Business communities in the United States are facing high demand for human resources, but one of the constant challenges is identifying and attracting the right talent, which is perhaps the most important element in remaining competitive. Companies in the United States look for hard-working, talented, and qualified individuals both locally as well as abroad.
The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on their wages or working conditions by ensuring US employers' compliance with statutory requirements when they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office of Foreign Labor Certification (OFLC).
OFLC processes job certification applications for employers seeking to bring foreign workers into the United States and grants certifications in those cases where employers can demonstrate that there are not sufficient US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in the area of intended employment.
In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and permanent labor certifications. This was a nine percent increase in the overall number of processed applications from the previous year. The process of reviewing every case is becoming a tedious task as the number of applicants is increasing every year.
The increasing number of applicants every year calls for a Machine Learning based solution that can help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired your firm EasyVisa for data-driven solutions. You as a data scientist have to analyze the data provided and, with the help of a classification model:
The data contains the different attributes of the employee and the employer. The detailed data dictionary is given below.
# this will help in making the Python code more structured automatically (good coding practice)
!pip install black[jupyter] --quiet
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from IPython.display import display
from matplotlib.ticker import MaxNLocator
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
import scipy.stats as stats
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
classification_report,
roc_auc_score,
#plot_confusion_matrix,
precision_recall_curve,
roc_curve,
make_scorer
)
custom = {"axes.edgecolor": "purple", "grid.linestyle": "solid", "grid.color": "black"}
sns.set_style("dark", rc=custom)
#format numeric data for easier readability
pd.set_option("display.float_format", lambda x: "{:.2f}".format(x)) # to display numbers rounded off to 2 decimal places
%matplotlib inline
# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import (BaggingClassifier,RandomForestClassifier,
GradientBoostingClassifier, AdaBoostClassifier, StackingClassifier)
# To tune different models
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
#To install xgboost library use - !pip install xgboost
!pip install xgboost
from xgboost import XGBClassifier
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 17.5 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 22.7 MB/s eta 0:00:00 Requirement already satisfied: xgboost in /usr/local/lib/python3.10/dist-packages (2.0.3) Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.25.2) Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.11.4)
# let colab access my google drive
from google.colab import drive
drive.mount("/content/drive")
Mounted at /content/drive
# Loading the dataset - sheet_name parameter is used if there are multiple tabs in the excel file.
df = pd.read_csv("/content/drive/MyDrive/Python_Course/Project_5/EasyVisa.csv")
df.head()
| case_id | continent | education_of_employee | has_job_experience | requires_job_training | no_of_employees | yr_of_estab | region_of_employment | prevailing_wage | unit_of_wage | full_time_position | case_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | EZYV01 | Asia | High School | N | N | 14513 | 2007 | West | 592.20 | Hour | Y | Denied |
| 1 | EZYV02 | Asia | Master's | Y | N | 2412 | 2002 | Northeast | 83425.65 | Year | Y | Certified |
| 2 | EZYV03 | Asia | Bachelor's | N | Y | 44444 | 2008 | West | 122996.86 | Year | Y | Denied |
| 3 | EZYV04 | Asia | Bachelor's | N | N | 98 | 1897 | West | 83434.03 | Year | Y | Denied |
| 4 | EZYV05 | Africa | Master's | Y | N | 1082 | 2005 | South | 149907.39 | Year | Y | Certified |
df.tail()
| case_id | continent | education_of_employee | has_job_experience | requires_job_training | no_of_employees | yr_of_estab | region_of_employment | prevailing_wage | unit_of_wage | full_time_position | case_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 25475 | EZYV25476 | Asia | Bachelor's | Y | Y | 2601 | 2008 | South | 77092.57 | Year | Y | Certified |
| 25476 | EZYV25477 | Asia | High School | Y | N | 3274 | 2006 | Northeast | 279174.79 | Year | Y | Certified |
| 25477 | EZYV25478 | Asia | Master's | Y | N | 1121 | 1910 | South | 146298.85 | Year | N | Certified |
| 25478 | EZYV25479 | Asia | Master's | Y | Y | 1918 | 1887 | West | 86154.77 | Year | Y | Certified |
| 25479 | EZYV25480 | Asia | Bachelor's | Y | N | 3195 | 1960 | Midwest | 70876.91 | Year | Y | Certified |
df.shape
(25480, 12)
There are 25480 rows and 12 columns in the dataset.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25480 entries, 0 to 25479 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 case_id 25480 non-null object 1 continent 25480 non-null object 2 education_of_employee 25480 non-null object 3 has_job_experience 25480 non-null object 4 requires_job_training 25480 non-null object 5 no_of_employees 25480 non-null int64 6 yr_of_estab 25480 non-null int64 7 region_of_employment 25480 non-null object 8 prevailing_wage 25480 non-null float64 9 unit_of_wage 25480 non-null object 10 full_time_position 25480 non-null object 11 case_status 25480 non-null object dtypes: float64(1), int64(2), object(9) memory usage: 2.3+ MB
case_id, continent, education_of)employee, has_job_experience, requires_job_training, region_of_employment, unit_of_wage, full_time_position and case_status has a Dtype of objects.
no_of_employees, yr_of_estab has a Dtype of integers.
prevailing_wage has a Dtype of float.
cols = df.select_dtypes(['object'])
cols.columns
Index(['case_id', 'continent', 'education_of_employee', 'has_job_experience',
'requires_job_training', 'region_of_employment', 'unit_of_wage',
'full_time_position', 'case_status'],
dtype='object')
for i in cols.columns:
df[i] = df[i].astype('category')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25480 entries, 0 to 25479 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 case_id 25480 non-null category 1 continent 25480 non-null category 2 education_of_employee 25480 non-null category 3 has_job_experience 25480 non-null category 4 requires_job_training 25480 non-null category 5 no_of_employees 25480 non-null int64 6 yr_of_estab 25480 non-null int64 7 region_of_employment 25480 non-null category 8 prevailing_wage 25480 non-null float64 9 unit_of_wage 25480 non-null category 10 full_time_position 25480 non-null category 11 case_status 25480 non-null category dtypes: category(9), float64(1), int64(2) memory usage: 2.0 MB
Updated Dtype object to Dtype category.
`we can see that the memory usage has decreased from 2.3 MB to 2.0MB, this technique is generally useful for bigger datasets.
# Copy data to avoid any changes to original date
df2 = df.copy()
# Use info() to print a concise summary of the DataFrame
df2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25480 entries, 0 to 25479 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 case_id 25480 non-null category 1 continent 25480 non-null category 2 education_of_employee 25480 non-null category 3 has_job_experience 25480 non-null category 4 requires_job_training 25480 non-null category 5 no_of_employees 25480 non-null int64 6 yr_of_estab 25480 non-null int64 7 region_of_employment 25480 non-null category 8 prevailing_wage 25480 non-null float64 9 unit_of_wage 25480 non-null category 10 full_time_position 25480 non-null category 11 case_status 25480 non-null category dtypes: category(9), float64(1), int64(2) memory usage: 2.0 MB
df2.duplicated().sum()
0
df2.isnull().sum()
case_id 0 continent 0 education_of_employee 0 has_job_experience 0 requires_job_training 0 no_of_employees 0 yr_of_estab 0 region_of_employment 0 prevailing_wage 0 unit_of_wage 0 full_time_position 0 case_status 0 dtype: int64
# checking for unique values
df2.nunique()
case_id 25480 continent 6 education_of_employee 4 has_job_experience 2 requires_job_training 2 no_of_employees 7105 yr_of_estab 199 region_of_employment 5 prevailing_wage 25454 unit_of_wage 4 full_time_position 2 case_status 2 dtype: int64
df2.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_employees | 25480.00 | 5667.04 | 22877.93 | -26.00 | 1022.00 | 2109.00 | 3504.00 | 602069.00 |
| yr_of_estab | 25480.00 | 1979.41 | 42.37 | 1800.00 | 1976.00 | 1997.00 | 2005.00 | 2016.00 |
| prevailing_wage | 25480.00 | 74455.81 | 52815.94 | 2.14 | 34015.48 | 70308.21 | 107735.51 | 319210.27 |
Average no_of_employees is 5667. The no_of_employees range from -26 to 602,069. The count of employees is 25480.
yr_of_estab range from 1800-2016.
Average prevailing_wage is 74,455.81. They range from 2.14 to 319,210.27.
df2.describe(include=['category']).T
| count | unique | top | freq | |
|---|---|---|---|---|
| case_id | 25480 | 25480 | EZYV01 | 1 |
| continent | 25480 | 6 | Asia | 16861 |
| education_of_employee | 25480 | 4 | Bachelor's | 10234 |
| has_job_experience | 25480 | 2 | Y | 14802 |
| requires_job_training | 25480 | 2 | N | 22525 |
| region_of_employment | 25480 | 5 | Northeast | 7195 |
| unit_of_wage | 25480 | 4 | Year | 22962 |
| full_time_position | 25480 | 2 | Y | 22773 |
| case_status | 25480 | 2 | Certified | 17018 |
Applicants are coming from 6 continents.
The education level most frequent for the applicants is the Bachelor's degree. The levels are High School, Bachelor's, Master's and Doctorate.
58% of the applicants have job experience.
88% of applicants do not require job training.
There are 5 regions of employment.
The applicant's unit of wage will be by Hour, Week, Month or Year.
They will be offered either a full-time position or not. 89% of applicants are offered a full time position.
The employee will either be certified or not. 67% of Visas are certified.
case_status is the target variable.
Dropping columns which are not adding any information.
df2.drop(['case_id'],axis=1,inplace=True)
Let's look at the unique values of all the categories
cols_cat= df2.select_dtypes(['category'])
cols_cat
| continent | education_of_employee | has_job_experience | requires_job_training | region_of_employment | unit_of_wage | full_time_position | case_status | |
|---|---|---|---|---|---|---|---|---|
| 0 | Asia | High School | N | N | West | Hour | Y | Denied |
| 1 | Asia | Master's | Y | N | Northeast | Year | Y | Certified |
| 2 | Asia | Bachelor's | N | Y | West | Year | Y | Denied |
| 3 | Asia | Bachelor's | N | N | West | Year | Y | Denied |
| 4 | Africa | Master's | Y | N | South | Year | Y | Certified |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 25475 | Asia | Bachelor's | Y | Y | South | Year | Y | Certified |
| 25476 | Asia | High School | Y | N | Northeast | Year | Y | Certified |
| 25477 | Asia | Master's | Y | N | South | Year | N | Certified |
| 25478 | Asia | Master's | Y | Y | West | Year | Y | Certified |
| 25479 | Asia | Bachelor's | Y | N | Midwest | Year | Y | Certified |
25480 rows × 8 columns
for i in cols_cat.columns:
print('Unique values in',i, 'are :')
print(cols_cat[i].value_counts())
print('*'*50)
Unique values in continent are : continent Asia 16861 Europe 3732 North America 3292 South America 852 Africa 551 Oceania 192 Name: count, dtype: int64 ************************************************** Unique values in education_of_employee are : education_of_employee Bachelor's 10234 Master's 9634 High School 3420 Doctorate 2192 Name: count, dtype: int64 ************************************************** Unique values in has_job_experience are : has_job_experience Y 14802 N 10678 Name: count, dtype: int64 ************************************************** Unique values in requires_job_training are : requires_job_training N 22525 Y 2955 Name: count, dtype: int64 ************************************************** Unique values in region_of_employment are : region_of_employment Northeast 7195 South 7017 West 6586 Midwest 4307 Island 375 Name: count, dtype: int64 ************************************************** Unique values in unit_of_wage are : unit_of_wage Year 22962 Hour 2157 Week 272 Month 89 Name: count, dtype: int64 ************************************************** Unique values in full_time_position are : full_time_position Y 22773 N 2707 Name: count, dtype: int64 ************************************************** Unique values in case_status are : case_status Certified 17018 Denied 8462 Name: count, dtype: int64 **************************************************
Leading Questions:
Those with higher education may want to travel abroad for a well-paid job. Does education play a role in Visa certification?
How does the visa status vary across different continents?
Experienced professionals might look abroad for opportunities to improve their lifestyles and career development. Does work experience influence visa status?
In the United States, employees are paid at different intervals. Which pay unit is most likely to be certified for a visa?
The US government has established a prevailing wage to protect local talent and foreign workers. How does the visa status change with the prevailing wage?
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(df2, feature, figsize=(12, 7), kde=False, bins=None):
"""
Creates a combined boxplot and histogram for a given feature in the dataset.
Args:
df2: The input dataframe.
feature (str): The column name for which to create the plot.
figsize (tuple, optional): Size of the figure (default: (12, 7)).
kde (bool, optional): Whether to show the density curve (default: False).
bins (int, optional): Number of bins for the histogram (default: None).
Returns:
None (displays the plot)
"""
fig, (ax_box, ax_hist) = plt.subplots(
nrows=2,
sharex=True,
figsize=figsize,
gridspec_kw={"height_ratios": (0.25, 0.75)},
)
# Boxplot
sns.boxplot(data=df2, x=feature, ax=ax_box, showmeans=True, color="#F72585")
# Histogram
if bins is None:
unique_values = df2[feature].unique()
bins = np.linspace(unique_values.min() - 1, unique_values.max() + 2, num=25)
sns.histplot(data=df2, x=feature, bins=bins, kde=True, ax=ax_hist)
# Add mean and median lines
ax_hist.axvline(df2[feature].mean(), color="purple", linestyle="--", label="Mean")
ax_hist.axvline(df2[feature].median(), color="blue", linestyle="-", label="Median")
# Label each bar with its count
for j, p in enumerate(ax_hist.patches):
ax_hist.annotate(
f"{int(p.get_height())}",
(p.get_x() + p.get_width() / 2.0, p.get_height()),
ha="center",
va="center",
xytext=(1, 10),
textcoords="offset points",
)
ax_hist.legend()
ax_hist.set_xlabel(feature)
ax_hist.set_ylabel("Frequency")
ax_hist.set_title(f"Frequency of {feature}")
plt.tight_layout()
# function to create labeled barplots
def labeled_barplot(df2, feature, order, perc=False, n=None):
"""
Barplot with percentage at the top
df2: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(df2[feature]) # length of the column
count = df2[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 10))
else:
plt.figure(figsize=(n + 1, 10))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=df2,
x=feature, # Assign the x variable to hue
palette="cubehelix", # Set the hue to the same variable
legend=False, # Disable the legend
order=df2[feature].value_counts().index[:n].sort_values(),
)
# Annotate each bar with its count and percentage
for p in ax.patches:
prc = "{:.1f}%".format(100.0 * p.get_height() / total) # percentage
cnt = p.get_height() # count
xx = p.get_x() + p.get_width() / 2 # x coordinate of bar percentage label
yy = p.get_height() # y coordinate of bar percentage label
# Annotate percentage
ax.annotate(
prc,
(xx, yy),
ha="center",
va="center",
style="italic",
size=12,
xytext=(0, 10),
textcoords="offset points",
)
# Annotate count (adjust vertical position)
ax.annotate(
cnt,
(xx, yy + 100),
ha="center",
va="bottom", # Adjusted to display above the percentage label
size=12,
xytext=(0, 20),
textcoords="offset points",
)
# Increase y-axis size by 500
plt.ylim(0, ax.get_ylim()[1] + 500)
# function to create labeled barplots
def labeled_barplot2(df2, feature, feature_2, order, perc=False, n=None):
"""
Barplot with percentage at the top
df2: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(df2[feature]) # length of the column
count = df2[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 20))
else:
plt.figure(figsize=(n + 1, 20))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=df2,
x=feature, # Assign the x variable to hue
palette="cubehelix", # Set the hue to the same variable
legend=True, # Disable the legend
order=df2[feature].value_counts().index[:n].sort_values(),
hue=feature_2,
)
# Annotate each bar with its count and percentage
for p in ax.patches:
prc = "{:.1f}%".format(100.0 * p.get_height() / total) # percentage
cnt = p.get_height() # count
xx = p.get_x() + p.get_width() / 2 # x coordinate of bar percentage label
yy = p.get_height() # y coordinate of bar percentage label
# Annotate percentage
ax.annotate(
prc,
(xx, yy),
ha="center",
va="center",
style="italic",
size=12,
xytext=(0, 10),
textcoords="offset points",
)
# Annotate count (adjust vertical position)
ax.annotate(
cnt,
(xx, yy + 100),
ha="center",
va="bottom", # Adjusted to display above the percentage label
size=12,
xytext=(0, 20),
textcoords="offset points",
)
# Increase y-axis size by 500
plt.ylim(0, ax.get_ylim()[1] + 500)
def stacked_barplot(df2, predictor, target, palette=None):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
palette: list of colors (optional)
"""
count = df2[predictor].nunique()
sorter = df2[target].value_counts().index[-1]
# Use a custom palette or default to Matplotlib's default colors
if palette:
colors = palette
else:
# Default colors (you can replace these with your own)
colors = ["#06C2AC", "#9A0EEA", "#ED0DD9", "#0000BB", "#DC143C"]
#Colors are Teal, Violet, Fuchsia, Navy, and Crimson
tab1 = pd.crosstab(df2[predictor], df2[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(df2[predictor], df2[target], normalize="index").sort_values(
by=sorter, ascending=False
)
# Plot using the specified colors
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5), color=colors)
plt.legend(loc="lower left", frameon=False)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(df2, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(15, 10))
target_uniq = df2[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=df2[df2[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="aqua",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=df2[df2[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="indigo",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=df2, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=df2,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="plasma",
)
plt.tight_layout()
plt.show()
# Create a figure with a specified size
plt.figure(figsize=(20, 6))
# Plot the histogram and boxplot
histogram_boxplot(df2, "no_of_employees")
# Set the x-axis label
plt.xlabel("No of Employees")
df2["no_of_employees"].value_counts()
print()
df2["no_of_employees"].describe().T
<Figure size 2000x600 with 0 Axes>
Text(0.5, 47.722222222222285, 'No of Employees')
no_of_employees
183 18
854 16
724 16
766 15
1476 15
..
5876 1
5536 1
47866 1
4700 1
40224 1
Name: count, Length: 7105, dtype: int64
count 25480.00 mean 5667.04 std 22877.93 min -26.00 25% 1022.00 50% 2109.00 75% 3504.00 max 602069.00 Name: no_of_employees, dtype: float64
<Figure size 2000x600 with 0 Axes>
No_of_employees is right skewed.
Average no_of_employees is 5667
A count of ~25,211 is less than 100,000 no of employees.
There are several outliers.
# Create a figure with a specified size
plt.figure(figsize=(20, 6))
# Plot the histogram and boxplot
histogram_boxplot(df2, "prevailing_wage")
# Set the x-axis label
plt.xlabel("Prevailing Wage")
df2["prevailing_wage"].value_counts()
print()
df2["prevailing_wage"].describe().T
<Figure size 2000x600 with 0 Axes>
Text(0.5, 47.722222222222285, 'Prevailing Wage')
prevailing_wage
82560.28 2
122.65 2
60948.15 2
64357.58 2
108.12 2
..
25713.98 1
101656.64 1
65665.55 1
50.88 1
70876.91 1
Name: count, Length: 25454, dtype: int64
count 25480.00 mean 74455.81 std 52815.94 min 2.14 25% 34015.48 50% 70308.21 75% 107735.51 max 319210.27 Name: prevailing_wage, dtype: float64
<Figure size 2000x600 with 0 Axes>
The mean and median are both between 70-75K.
There are several outliers.
# Labeled barplot for type of meal plan
labeled_barplot(df2, "continent", order=None, perc=True, n=25)
Most of the applicants are coming from Asia at 66.2%. Order of applicant's continent:
# Labeled barplot for type of education of employee
labeled_barplot(df2, "education_of_employee", order=["High School","Bachelor's","Master's","Doctorate"], perc=True, n=25)
40.2 % of applicants have a Bachelor's degree.
37.8% of applicants have a Master's degree.
13.4% of applicants only have a high school diploma.
8.6% of applicants have their Doctorate.
# Labeled barplot for type of meal plan
labeled_barplot(df2, "has_job_experience", order = None, perc=True, n=25)
58.1% of applicants have job experience, whereas 41.9% do not have job experience.
# Labeled barplot for type of meal plan
labeled_barplot(df2, "requires_job_training", order=None, perc=True, n=25)
88.4% of applicants do not require job training.
11.6% of the applicants will need job training
# Labeled barplot for type of meal plan
labeled_barplot(df2, "yr_of_estab", order=None, perc=True, n=25)
Graph is showing the top 25.
1998 has the highest amount of companies that were started that year, however the rate is only 4.5% of the companies.
There are several companies on several of the years.
Not sure the age of the company really makes a difference based on the graph.
Will consider dropping this feature.
# Labeled barplot for type of meal plan
labeled_barplot(df2, "region_of_employment", order=None, perc=True, n=25)
There is not a huge difference between the top 3 regions.
The bottom 2 regions account for 18.4% of all the jobs regions.
# Labeled barplot for type of meal plan
labeled_barplot(df2, "unit_of_wage", order=None, perc=True, n=25)
Most jobs have an annual salary, 90.1%
8.5% are hourly jobs.
Weekly plus Monthly jobs salaries only account for 1.4% of all jobs.
# Labeled barplot for type of meal plan
labeled_barplot(df2, "full_time_position", order=None, perc=True, n=25)
Most of the jobs are full time positions.
Full time jobs account for 89.4%, whereas non full time jobs only account for 10.6%.
# Labeled barplot for type of meal plan
labeled_barplot(df2, "case_status", order=None, perc=True, n=25)
Certified Visa's account for 66.8% of all applications.
Denied Visa's account for 33.2% of all applications
plt.figure(figsize=(20,10))
#sns.heatmap(data.corr(),annot=True,vmin=-1,vmax=1,fmt='.2f',cmap="Spectral")
#plt.show()
# Select only numeric columns
numeric_data = df2.select_dtypes(include=[float, int])
# Compute correlation matrix
corr_matrix = numeric_data.corr()
# Plot heatmap
sns.heatmap(corr_matrix, annot=True, vmin=-1, vmax=1, fmt='.2f', cmap="viridis")
plt.show();
There are no correlations between any of the fields.
stacked_barplot(df2, "education_of_employee", "case_status")
case_status Certified Denied All education_of_employee All 17018 8462 25480 Bachelor's 6367 3867 10234 High School 1164 2256 3420 Master's 7575 2059 9634 Doctorate 1912 280 2192 ------------------------------------------------------------------------------------------------------------------------
Applicants having a Doctorate have the highest certified rate. Doctorate applications are certified 87.23% of the time and denied 12.77% of the time.
Applicants who have a Masters are certified 78.63% of the time and denied 21.37% of the time.
Applicants who have a Bachelors are certified 62.21% of the time and denied 37.79% of the time.
Applicants who only have a high school diploma are denied more than they are certified. They are certified 34.04% of the time and denied 65.96% of the time.
1. Those with higher education may want to travel abroad for a well-paid job. Does education play a role in Visa certification?
Education does play a role in Visa Certification.
Doctorate - 87.23%
Masters - 78.63%
Bachelors - 62.21%
High School - 34.04%
stacked_barplot(df2, "continent", "case_status")
case_status Certified Denied All continent All 17018 8462 25480 Asia 11012 5849 16861 North America 2037 1255 3292 Europe 2957 775 3732 South America 493 359 852 Africa 397 154 551 Oceania 122 70 192 ------------------------------------------------------------------------------------------------------------------------
Asia has the highest amount of applicants at 16,861. Of those, only 72.05% of applications are certified, 27.95% are denied.
The highest rate for Certified for a continent is with Europe. Of all applications, 79.23% are certified and 20.77% are denied.
South American has the highest rate of denials. Of all applications, 57.86% are approved and 42.14% are denied.
2. How does the visa status vary across different continents?
Certified rates accross all continents range from 57.86% to 72.05%, whereas denial rates range from 20.77% to 42.14%.
Africa - Certified 72.05% Denied 27.95%
Asia - Certified 65.31% Denied 34.69%
Europe - Certified 79.23% Denied 20.77%
North America - Certified 61.88% Denied 38.12%
Oceania - Certified 63.54% Denied 36.46%
South America - Certified 57.86% Denied 42.14%
stacked_barplot(df2, "has_job_experience", "case_status")
case_status Certified Denied All has_job_experience All 17018 8462 25480 N 5994 4684 10678 Y 11024 3778 14802 ------------------------------------------------------------------------------------------------------------------------
There is a higher chance of being certified if you have job experience.
If the applicant has job experience they are certified 74.48% and denied 25.52%.
If the applicant has no job experience they are certified 56.13% and denied 43.87%.
3. Experienced professionals might look abroad for opportunities to improve their lifestyles and career development. Does work experience influence visa status?
Having working experience does seem to have a small impact on Visa Status.
Have Job Experience - Certified 74.48% Denied 25.52%
No Job Experience - Certified 56.13% Denied 43.87%
stacked_barplot(df2, "requires_job_training", "case_status")
case_status Certified Denied All requires_job_training All 17018 8462 25480 N 15012 7513 22525 Y 2006 949 2955 ------------------------------------------------------------------------------------------------------------------------
Whether or not training is required does not seem to make a difference.
No training needed - Certified 66.65% Denied 33.35%
Training required - Certified 67.88% Denied 32.12%
Considering dropping this column since it has little impact on the target variable.
stacked_barplot(df2, "yr_of_estab", "case_status")
case_status Certified Denied All yr_of_estab All 17018 8462 25480 1998 736 398 1134 2001 656 361 1017 2005 719 332 1051 2007 682 312 994 1999 567 303 870 2000 521 285 806 2004 572 274 846 1997 512 249 761 2006 610 234 844 2010 522 221 743 2003 394 220 614 2009 433 207 640 2008 487 187 674 1993 287 177 464 1994 309 165 474 2012 329 163 492 2011 360 158 518 2013 379 154 533 2002 286 149 435 1996 320 146 466 1995 251 141 392 1989 258 125 383 1984 212 116 328 1985 199 116 315 1986 195 109 304 1977 212 107 319 1987 191 104 295 1968 204 98 302 1992 163 94 257 1979 132 81 213 1981 176 80 256 1988 159 74 233 1991 175 68 243 1990 117 67 184 1978 122 60 182 2014 118 57 175 1975 93 56 149 1969 102 55 157 1976 92 54 146 1838 105 52 157 1982 84 51 135 1983 110 50 160 1947 79 49 128 1971 92 48 140 1974 77 46 123 1980 125 46 171 1963 83 46 129 1855 79 41 120 1962 76 38 114 1960 70 36 106 1946 93 35 128 1911 77 33 110 1970 79 32 111 1973 59 31 90 1948 52 30 82 1965 67 30 97 2015 36 28 64 1972 74 28 102 1954 46 27 73 1885 64 26 90 1949 36 23 59 1939 46 23 69 1928 41 23 64 1966 39 22 61 1920 52 22 74 1950 26 22 48 1884 26 21 47 1925 39 20 59 1961 43 20 63 1902 28 19 47 1869 31 19 50 1888 19 18 37 1817 31 17 48 1945 18 17 35 1926 18 17 35 1953 29 16 45 1952 29 16 45 1964 22 16 38 1929 15 16 31 1847 38 16 54 1956 33 15 48 1909 40 15 55 1873 37 15 52 1934 25 15 40 1912 36 15 51 1896 17 14 31 1919 37 14 51 1899 18 14 32 1872 27 14 41 1900 25 14 39 1957 22 13 35 1959 32 13 45 1890 19 13 32 1967 30 13 43 1839 23 13 36 1907 20 13 33 1853 20 12 32 1932 15 12 27 1875 24 12 36 1958 32 12 44 1940 27 12 39 1897 25 12 37 1892 22 11 33 1923 23 11 34 1906 10 11 21 1880 15 11 26 1818 10 11 21 1868 39 11 50 1913 13 10 23 2016 13 10 23 1863 14 10 24 1933 21 10 31 1908 26 10 36 1904 21 10 31 1851 14 10 24 1881 17 10 27 1898 23 10 33 1944 21 9 30 1941 13 9 22 1951 37 9 46 1916 18 9 27 1938 16 9 25 1935 11 9 20 1931 26 9 35 1858 19 9 28 1843 6 9 15 1870 16 9 25 1914 21 9 30 1936 17 9 26 1887 24 8 32 1804 10 8 18 1841 4 8 12 1876 19 8 27 1882 14 8 22 1886 20 8 28 1850 27 8 35 1942 9 8 17 1930 21 8 29 1831 16 8 24 1922 20 8 28 1955 13 7 20 1866 10 7 17 1854 18 7 25 1856 14 7 21 1859 13 7 20 1927 10 7 17 1865 14 7 21 1867 25 7 32 1874 8 7 15 1915 16 7 23 1801 6 6 12 1834 7 6 13 1840 12 6 18 1917 16 6 22 1877 13 6 19 1848 18 6 24 1905 10 6 16 1849 12 6 18 1901 16 6 22 1889 36 6 42 1894 5 6 11 1852 13 6 19 1883 11 6 17 1800 11 5 16 1943 13 5 18 1937 18 5 23 1924 21 5 26 1921 6 5 11 1910 10 5 15 1832 2 5 7 1895 17 5 22 1893 11 5 16 1891 13 5 18 1836 5 5 10 1878 17 5 22 1871 20 5 25 1864 17 5 22 1837 9 4 13 1809 5 4 9 1845 8 4 12 1862 6 4 10 1879 12 4 16 1903 12 4 16 1857 9 3 12 1860 4 3 7 1833 8 3 11 1830 3 3 6 1819 14 3 17 1807 4 2 6 1821 6 2 8 1918 12 2 14 1823 5 2 7 1824 0 2 2 1861 9 2 11 1822 3 1 4 1820 5 1 6 1842 5 1 6 1810 3 0 3 1846 4 0 4 ------------------------------------------------------------------------------------------------------------------------
Certified range from 0-100%, Denied range from 0 to 100%.
The age of the establishment does not appear to make any difference in whether or not they are certified.
The oldest company has a Certified rate of 69% and Denied rate of 31%, whereas the newest company has a Certified rate of 57% and Denied rate of 43%.
Considering removing this feature.
#Create a yr_of_estab count range
df2["yr_of_estab_range"] = pd.cut(
x=df2.yr_of_estab,
bins=[-np.infty, 1873, 1923, 1973, 2003, np.infty],
labels=["Over 150 yrs", "100-149 yrs", "50-99 yrs", "20-49 yrs", "Less than 20 yrs"],
)
df2["yr_of_estab_range"].value_counts() # creating the yr_of_estab category based on the year values
yr_of_estab_range 20-49 yrs 11829 Less than 20 yrs 7597 50-99 yrs 3117 100-149 yrs 1597 Over 150 yrs 1340 Name: count, dtype: int64
distribution_plot_wrt_target(df2, "yr_of_estab_range", "case_status")
stacked_barplot(df2, "yr_of_estab_range", "case_status")
case_status Certified Denied All yr_of_estab_range All 17018 8462 25480 20-49 yrs 7731 4098 11829 Less than 20 yrs 5260 2337 7597 50-99 yrs 2060 1057 3117 100-149 yrs 1074 523 1597 Over 150 yrs 893 447 1340 ------------------------------------------------------------------------------------------------------------------------
Less than 20 yrs - Certified 69.24% Denied 30.76%
20-49 yrs - Certified 65.36% Denied 34.64%
50-99 yrs - Certified 66.09% Denied 33.91%
100-149 yrs - Certified 67.25% Denied 32.75%
Over 150 yrs - Certified 66.64% Denied 33.36%.
There is not much difference between certification rates.
Highest certification range to lowest certification rate is only a difference of 3.88%.
stacked_barplot(df2, "region_of_employment", "case_status")
case_status Certified Denied All region_of_employment All 17018 8462 25480 Northeast 4526 2669 7195 West 4100 2486 6586 South 4913 2104 7017 Midwest 3253 1054 4307 Island 226 149 375 ------------------------------------------------------------------------------------------------------------------------
The region does not seem to have much impact in whether or not the Visa is certified.
Certified rates range from 60 to 76% and Denied range from 24% to 40%.
Island has the lowest Certified rate at 60%, and Midwest has the highest rate at 76%.
This is another feature considering removing since it has little if any impact on whether or not certified.
stacked_barplot(df2, "unit_of_wage", "case_status")
case_status Certified Denied All unit_of_wage All 17018 8462 25480 Year 16047 6915 22962 Hour 747 1410 2157 Week 169 103 272 Month 55 34 89 ------------------------------------------------------------------------------------------------------------------------
Hourly wage has the lowest certified rate of 35%. Their denied rate is 65%.
Yearly wage has the highest certified raate of 70%. Their denied rate is 30%.
Monthly and weekly rate have the same certified rate of 62%. Their denied rate is 38%.
4. In the United States, employees are paid at different intervals. Which pay unit is most likely to be certified for a visa?
Yearly paid employees are the most likely to be certified for a Visa. Their certification rate is 70%.
stacked_barplot(df2, "full_time_position", "case_status")
case_status Certified Denied All full_time_position All 17018 8462 25480 Y 15163 7610 22773 N 1855 852 2707 ------------------------------------------------------------------------------------------------------------------------
Whether or not the position is full time or not does not seem to impact whether or not the Visa is certified.
Full time position - Certified 67% Denied 33%.
Not full time position - Certified 69% Denied 31%.
This is another feature to consider dropping.
# Create the 'pay_term' column
df2['pay_term'] = np.where(df2['unit_of_wage'] == 'Hour', 'Day', 'Annual')
df2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25480 entries, 0 to 25479 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 continent 25480 non-null category 1 education_of_employee 25480 non-null category 2 has_job_experience 25480 non-null category 3 requires_job_training 25480 non-null category 4 no_of_employees 25480 non-null int64 5 yr_of_estab 25480 non-null int64 6 region_of_employment 25480 non-null category 7 prevailing_wage 25480 non-null float64 8 unit_of_wage 25480 non-null category 9 full_time_position 25480 non-null category 10 case_status 25480 non-null category 11 yr_of_estab_range 25480 non-null category 12 pay_term 25480 non-null object dtypes: category(9), float64(1), int64(2), object(1) memory usage: 1021.8+ KB
#The data shows that the unit_of_wage 'Hour' is a per-day amount, whearas the other three unit_of_wage categories are an annual salary amount.
# Define conversion factors
CONVERSION_FACTORS = {
'Hour': 260, # 260 weekdays (assuming 5 days per week), some employees will work less than this, would need amount of hours worked to better cAlculate a yearly amount
'Week': 1, # already annual
'Month': 1, # already annual
'Year': 1 # already annual
}
# Function to convert wage to annual salary
def convert_to_annual(row):
wage = row['prevailing_wage']
unit = row['unit_of_wage']
return wage * CONVERSION_FACTORS[unit]
# Apply the conversion function
df2['annual_wage'] = df2.apply(convert_to_annual,axis=1)
#Create a employee count range
df2["annual_wage_range"] = pd.cut(
x=df2.annual_wage,
bins=[-np.infty, 50000, 100000, 150000,np.infty],
labels=["Below 50K", "50K - 100K", "100K to 150K", "Above 150K"],
)
df2["annual_wage_range"].value_counts() # creating the annual wage category based on the wage values
annual_wage_range 50K - 100K 9284 Below 50K 7578 100K to 150K 6135 Above 150K 2483 Name: count, dtype: int64
df2[df2["unit_of_wage"]=="Hour"]
| continent | education_of_employee | has_job_experience | requires_job_training | no_of_employees | yr_of_estab | region_of_employment | prevailing_wage | unit_of_wage | full_time_position | case_status | yr_of_estab_range | pay_term | annual_wage | annual_wage_range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Asia | High School | N | N | 14513 | 2007 | West | 592.20 | Hour | Y | Denied | Less than 20 yrs | Day | 153972.75 | Above 150K |
| 7 | North America | Bachelor's | Y | N | 3035 | 1924 | West | 418.23 | Hour | Y | Denied | 50-99 yrs | Day | 108739.75 | 100K to 150K |
| 54 | Asia | Master's | Y | N | 11733 | 1995 | Northeast | 230.81 | Hour | Y | Certified | 20-49 yrs | Day | 60009.87 | 50K - 100K |
| 62 | Asia | High School | N | N | 5110 | 2004 | West | 103.22 | Hour | Y | Denied | Less than 20 yrs | Day | 26837.62 | Below 50K |
| 70 | Asia | High School | Y | N | 1320 | 2001 | Northeast | 230.33 | Hour | Y | Denied | 20-49 yrs | Day | 59885.02 | 50K - 100K |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 25444 | South America | Master's | Y | N | 1081 | 1838 | Northeast | 156.61 | Hour | Y | Certified | Over 150 yrs | Day | 40717.82 | Below 50K |
| 25450 | Asia | Bachelor's | N | N | 3312 | 2009 | Northeast | 682.10 | Hour | Y | Denied | Less than 20 yrs | Day | 177347.25 | Above 150K |
| 25461 | Asia | Master's | Y | N | 2861 | 2004 | West | 54.92 | Hour | Y | Denied | Less than 20 yrs | Day | 14279.10 | Below 50K |
| 25465 | North America | High School | N | N | 2577 | 1995 | South | 481.22 | Hour | Y | Certified | 20-49 yrs | Day | 125118.19 | 100K to 150K |
| 25470 | North America | Master's | Y | N | 2272 | 1970 | Northeast | 516.41 | Hour | Y | Certified | 50-99 yrs | Day | 134266.63 | 100K to 150K |
2157 rows × 15 columns
distribution_plot_wrt_target(df2, "annual_wage_range", "case_status")
stacked_barplot(df2, "annual_wage_range", "case_status")
case_status Certified Denied All annual_wage_range All 17018 8462 25480 50K - 100K 6308 2976 9284 Below 50K 5144 2434 7578 100K to 150K 4050 2085 6135 Above 150K 1516 967 2483 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df2, "pay_term", "case_status")
case_status Certified Denied All pay_term All 17018 8462 25480 Annual 16271 7052 23323 Day 747 1410 2157 ------------------------------------------------------------------------------------------------------------------------
5. The US government has established a prevailing wage to protect local talent and foreign workers. How does the visa status change with the prevailing wage?
Highest certification to lowest certification rate is only a difference of 6.88%.
Below 50K - Certified 67.88% Denied 32.12%
50K - 100K - Certified 67.94% Denied 32.06%
100K - 150K - Certified 66.01% Denied 33.99%
Above 150K - Certified 61.06% Denied 38.94%
Certified ranges from 61.06% to 67.94%. The highest certificaion rate is for applicants who make 50-100K.The lowest rate is for applicants who make above 1certification50K.
You have a less chance of getting certified if your prevailing_wage is based on the day vs the year.
Day - Certified 34.63%, Denied 65.37%
Annual - Certified 69.76%, Denied 30.24%
distribution_plot_wrt_target(df2, "no_of_employees", "case_status")
#Create a employee count range
df2["no_of_employees_range"] = pd.cut(
x=df2.prevailing_wage,
bins=[-np.infty, 25000, 75000, 100000,np.infty],
labels=["Less than 25K", "25K and 75K", "75K to 100K", "Above 100K"],
)
df2["no_of_employees_range"].value_counts() # creating the no_of_employees category based on the employee count
no_of_employees_range 25K and 75K 8581 Above 100K 7559 Less than 25K 5008 75K to 100K 4332 Name: count, dtype: int64
distribution_plot_wrt_target(df2, "no_of_employees_range", "case_status")
stacked_barplot(df2, "no_of_employees_range", "case_status")
case_status Certified Denied All no_of_employees_range All 17018 8462 25480 25K and 75K 5972 2609 8581 Above 100K 5195 2364 7559 Less than 25K 2837 2171 5008 75K to 100K 3014 1318 4332 ------------------------------------------------------------------------------------------------------------------------
Less than 25K employees - Certified 56.65% Denied 43.35%
25K - 75K employees - Certified 69.60% Denied 31.27%
75K - 100K employees - Certified 69.58% Denied 30.42%
Above 100K employees - Certified 68.73% Denied 31.27%
Any companies that have greater than 25K of employees applicants have a ~69% to 70% certification rate. Companies that have less than 25K if employees applicants only have a 57% certification rate.
df2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25480 entries, 0 to 25479 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 continent 25480 non-null category 1 education_of_employee 25480 non-null category 2 has_job_experience 25480 non-null category 3 requires_job_training 25480 non-null category 4 no_of_employees 25480 non-null int64 5 yr_of_estab 25480 non-null int64 6 region_of_employment 25480 non-null category 7 prevailing_wage 25480 non-null float64 8 unit_of_wage 25480 non-null category 9 full_time_position 25480 non-null category 10 case_status 25480 non-null category 11 yr_of_estab_range 25480 non-null category 12 pay_term 25480 non-null object 13 annual_wage 25480 non-null float64 14 annual_wage_range 25480 non-null category 15 no_of_employees_range 25480 non-null category dtypes: category(11), float64(2), int64(2), object(1) memory usage: 1.2+ MB
# outlier detection using boxplot
numeric_columns = df2.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(6, 4, i + 1)
plt.boxplot(df2[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show();
There are several outliers in the data.
Verify no_of_employees
#show the number of rows with a value of no_of_employees less than 0, which is not a possible value
df2[df2.no_of_employees<0].shape
(33, 16)
#drop the rows with errors shown above and check the remaining number of rows
df2 = df2[df2.no_of_employees>0]
df2.shape
(25447, 16)
df2["no_of_employees_range"].value_counts() # creating the no_of_employees category based on the wage values
no_of_employees_range 25K and 75K 8570 Above 100K 7550 Less than 25K 5001 75K to 100K 4326 Name: count, dtype: int64
# shows statistical information
df2.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| continent | 25447 | 6 | Asia | 16840 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| education_of_employee | 25447 | 4 | Bachelor's | 10220 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| has_job_experience | 25447 | 2 | Y | 14786 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| requires_job_training | 25447 | 2 | N | 22498 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| no_of_employees | 25447.00 | NaN | NaN | NaN | 5674.42 | 22891.84 | 12.00 | 1025.00 | 2112.00 | 3506.50 | 602069.00 |
| yr_of_estab | 25447.00 | NaN | NaN | NaN | 1979.39 | 42.39 | 1800.00 | 1976.00 | 1997.00 | 2005.00 | 2016.00 |
| region_of_employment | 25447 | 5 | Northeast | 7189 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| prevailing_wage | 25447.00 | NaN | NaN | NaN | 74468.28 | 52822.18 | 2.14 | 34039.21 | 70312.50 | 107739.51 | 319210.27 |
| unit_of_wage | 25447 | 4 | Year | 22933 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| full_time_position | 25447 | 2 | Y | 22741 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| case_status | 25447 | 2 | Certified | 17001 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| yr_of_estab_range | 25447 | 5 | 20-49 yrs | 11811 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| pay_term | 25447 | 2 | Annual | 23294 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| annual_wage | 25447.00 | NaN | NaN | NaN | 83557.33 | 52633.76 | 100.00 | 43607.26 | 77296.59 | 113999.87 | 319210.27 |
| annual_wage_range | 25447 | 4 | 50K - 100K | 9275 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| no_of_employees_range | 25447 | 4 | 25K and 75K | 8570 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Asia is the most frequent continent.
Bachelor's is the most frequent education of employees.
More applicants have job experience than not.
More applicants do not need job training than need training.
Northeast is the most frequent region of employment.
Year is the most frequent unit of wage.
More applicants are going to have a full time position than not.
More applicants are certified than not.
More applicants have a low prevailing wage than any other range.
More applicants are going to a company with a medium range of employees than any other range.
df2['case_status'] = df2['case_status'].apply(lambda x : 1 if x=='Certified' else 0)
df2['case_status'].unique()
[0, 1] Categories (2, int64): [1, 0]
df2.info()
<class 'pandas.core.frame.DataFrame'> Index: 25447 entries, 0 to 25479 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 continent 25447 non-null category 1 education_of_employee 25447 non-null category 2 has_job_experience 25447 non-null category 3 requires_job_training 25447 non-null category 4 no_of_employees 25447 non-null int64 5 yr_of_estab 25447 non-null int64 6 region_of_employment 25447 non-null category 7 prevailing_wage 25447 non-null float64 8 unit_of_wage 25447 non-null category 9 full_time_position 25447 non-null category 10 case_status 25447 non-null category 11 yr_of_estab_range 25447 non-null category 12 pay_term 25447 non-null object 13 annual_wage 25447 non-null float64 14 annual_wage_range 25447 non-null category 15 no_of_employees_range 25447 non-null category dtypes: category(11), float64(2), int64(2), object(1) memory usage: 1.4+ MB
We will use 70% of data for training and 30% for testing.
#create a dataframe of the predictor feature columns
X = df2.drop('case_status',axis=1)
#create a datafrane of the predicted class (1=True, 0=False)
Y = df2['case_status']
#generate dummy variables for each categorical variable
X = pd.get_dummies(X, drop_first=True)
#split the data into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1,stratify=Y)
#confirm the split
print("{0:0.2f}% data is in training set".format((len(X_train)/len(df2.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(X_test)/len(df2.index)) * 100))
70.00% data is in training set 30.00% data is in test set
#confirm the shape of both data sets and the ratio of classes is the same across both train and test datasets
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print(' ')
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print(' ')
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (17812, 33) Shape of test set : (7635, 33) Percentage of classes in training set: case_status 1 0.67 0 0.33 Name: proportion, dtype: float64 Percentage of classes in test set: case_status 1 0.67 0 0.33 Name: proportion, dtype: float64
Model Evaluation criterion
Model can make wrong predictions as:
Which case is more important?
Both are important:
How to reduce the losses?
Recall - It gives the ratio of True positives to Actual positives, so high Recall implies low false negatives, i.e. low chances of predicting a defaulter as non defaulterDefining functions to provide metric scores (i.e., accuracy, recall, and precision) on the train and test datasets and to show the resulting confusion matrices.
## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='',cmap='cool')
plt.ylabel('True label')
plt.xlabel('Predicted label')
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model,flag=True):
'''
model : classifier to predict values of X
'''
# defining an empty list to store train and test results
score_list=[]
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
train_acc = model.score(X_train,y_train)
test_acc = model.score(X_test,y_test)
train_recall = metrics.recall_score(y_train,pred_train)
test_recall = metrics.recall_score(y_test,pred_test)
train_precision = metrics.precision_score(y_train,pred_train)
test_precision = metrics.precision_score(y_test,pred_test)
train_F1_score = metrics.f1_score(y_train,pred_train)
test_F1_score = metrics.f1_score(y_test,pred_test)
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision,train_F1_score,test_F1_score ))
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ",round(model.score(X_train,y_train),4))
print("Accuracy on test set : ",round(model.score(X_test,y_test),4))
print("Recall on training set : ",round(metrics.recall_score(y_train,pred_train),4))
print("Recall on test set : ",round(metrics.recall_score(y_test,pred_test),4))
print("Precision on training set : ",round(metrics.precision_score(y_train,pred_train),4))
print("Precision on test set : ",round(metrics.precision_score(y_test,pred_test),4))
print("F1 score on training set : ",round(metrics.f1_score(y_train,pred_train),4))
print("F1 score on test set : ",round(metrics.f1_score(y_test,pred_test),4))
return score_list # returning the list with train and test scores
We’ll create our model using the DecisionTreeClassifier function with the default 'gini' criterion for splitting.
decisiontree = DecisionTreeClassifier(criterion='gini', random_state=1, class_weight='balanced')
decisiontree.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(class_weight='balanced', random_state=1)
Confusion Matrix -
Applicant was approved and the model predicted approval : True Positive (observed=1,predicted=1)
Applicant was denied and the model predicted approval : False Positive (observed=0,predicted=1)
Applicant was denied and the model predicted denial : True Negative (observed=0,predicted=0)
Applicant was approved and the model predicted denial : False Negative (observed=1,predicted=0)
make_confusion_matrix(decisiontree,y_test)
#Accuracy, recall, precision and F1 score on train and test set
decisiontree_score=get_metrics_score(decisiontree)
Accuracy on training set : 1.0 Accuracy on test set : 0.6678 Recall on training set : 1.0 Recall on test set : 0.7522 Precision on training set : 1.0 Precision on test set : 0.751 F1 score on training set : 1.0 F1 score on test set : 0.7516
The decision tree is showing 100% on everything for the training data.
The decision tree is showing 75% on recall, precision, and F1 on the testing data and 67% on the accuracy.
Since there is a big difference between the training and testing model, the model is overfit.
bagging = BaggingClassifier(random_state=1)
bagging.fit(X_train,y_train)
BaggingClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
BaggingClassifier(random_state=1)
make_confusion_matrix(bagging,y_test)
#Accuracy, recall, precision and F1 score on train and test set
bagging_score=get_metrics_score(bagging)
Accuracy on training set : 0.9851 Accuracy on test set : 0.6982 Recall on training set : 0.9864 Recall on test set : 0.7726 Precision on training set : 0.9913 Precision on test set : 0.775 F1 score on training set : 0.9888 F1 score on test set : 0.7738
The bagging classifier is showing 99% on everything for the training data.
The bagging classifier is showing 77% on recall and F1, 78% on precision, and 70% on accuracy on the testing data.
Since there is a big difference between the training and testing model, the model is overfit.
The bagging classifier has a higher percentage on accuracy, recall, precision and F1 than the decision tree.
ranfor = RandomForestClassifier(random_state=1)
ranfor.fit(X_train,y_train)
RandomForestClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(random_state=1)
make_confusion_matrix(ranfor,y_test)
#Accuracy, recall, precision and F1 score on train and test set
ranfor_score=get_metrics_score(ranfor)
Accuracy on training set : 0.9999 Accuracy on test set : 0.7236 Recall on training set : 1.0 Recall on test set : 0.8432 Precision on training set : 0.9999 Precision on test set : 0.7665 F1 score on training set : 1.0 F1 score on test set : 0.803
The random forest is showing 100% on everything for the training data.
The random forest is showing 84% on recall, 80% F1, 77% on precision, and 72% on accuracy on the testing data.
Since there is a big difference between the training and testing model, the model is overfit.
The random first has the highest percentage on accuracy, recall, precision and F1 than any of the previous models.
abc = AdaBoostClassifier(random_state=1)
abc.fit(X_train,y_train)
AdaBoostClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AdaBoostClassifier(random_state=1)
make_confusion_matrix(abc,y_test)
#Accuracy, recall, precision and F1 score on train and test set
abc_score=get_metrics_score(abc)
Accuracy on training set : 0.739 Accuracy on test set : 0.734 Recall on training set : 0.8905 Recall on test set : 0.879 Precision on training set : 0.76 Precision on test set : 0.7603 F1 score on training set : 0.8201 F1 score on test set : 0.8153
The AdaBoost is showing 74% on accuracy, 89% on recall, 76% on precision and 82% on the F1 score on the training data.
The AdaBoost is showing 73% on accuracy, 88% on recall, 76% on precision and 82% on the F1 score on the testing data.
The training and testing are almost identical when it comes to percentages.
This demonstrates that both the training and testing model are performing equally well on the data.
The AdaBoost has the highest percentage on accuracy, recall, and F1 than any of the previous models.
gbc = GradientBoostingClassifier(random_state=1)
gbc.fit(X_train,y_train)
GradientBoostingClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingClassifier(random_state=1)
make_confusion_matrix(gbc,y_test)
#Accuracy, recall, precision and F1 score on train and test set
gbc_score=get_metrics_score(gbc)
Accuracy on training set : 0.7568 Accuracy on test set : 0.7484 Recall on training set : 0.8801 Recall on test set : 0.8677 Precision on training set : 0.7829 Precision on test set : 0.7803 F1 score on training set : 0.8286 F1 score on test set : 0.8217
The Gradient Boosting is showing 76% on accuracy, 88% on recall, 78% on precision and 83% on the F1 score on the training data.
The Gradient Boosting is showing 75% on accuracy, 87% on recall, 78% on precision and 82% on the F1 score on the testing data.
The training and testing are almost identical when it comes to percentages.
This demonstrates that both the training and testing model are performing equally well on the data.
The Gradient Boosting has the highest percentage on accuracy, precision, and F1 than any of the previous models.
xgb = XGBClassifier(random_state=1,eval_metric='logloss')
xgb.fit(X_train,y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=None,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=None,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)make_confusion_matrix(xgb,y_test)
#Accuracy, recall, precision and F1 score on train and test set
xgb_score=get_metrics_score(xgb)
Accuracy on training set : 0.8452 Accuracy on test set : 0.7269 Recall on training set : 0.9355 Recall on test set : 0.8475 Precision on training set : 0.8484 Precision on test set : 0.7679 F1 score on training set : 0.8898 F1 score on test set : 0.8057
The XGBoost is showing 85% on accuracy, 94% on recall, 85% on precision and 89% on the F1 score on the training data.
The XGBoost is showing 73% on accuracy, 85% on recall, 77% on precision and 81% on the F1 score on the testing data.
There are differences between the training and testing when it comes to all the percentages. This indicates there is overfitting.
The XGBoost has higher percentages on the training data than Gradient Boosting but not on the testing data.
The Gradient Boosting still has the highest percentage on accuracy, precision, and F1 than any of the previous models.
# defining list of models
models = [decisiontree, bagging, ranfor, abc, gbc, xgb]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
F1_score_train = []
F1_score_test = []
# looping through all the models to get the accuracy, precall and precision scores
for model in models:
j = get_metrics_score(model,False)
acc_train.append(np.round(j[0],4))
acc_test.append(np.round(j[1],4))
recall_train.append(np.round(j[2],4))
recall_test.append(np.round(j[3],4))
precision_train.append(np.round(j[4],4))
precision_test.append(np.round(j[5],4))
F1_score_train.append(np.round(j[6],4))
F1_score_test.append(np.round(j[7],4))
# Set the display format to show 4 decimal places
pd.options.display.float_format = '{:.4f}'.format
comparison_frame = pd.DataFrame({'Model':['Decision Tree',
'Bagging Classifier',
'Random Forest',
'AdaBoost with default paramters',
'Gradient Boosting with default parameters',
'XGBoost with default parameters'],
'Train_Accuracy': acc_train,
'Test_Accuracy': acc_test,
'Train_Recall': recall_train,
'Test_Recall': recall_test,
'Train_Precision': precision_train,
'Test_Precision': precision_test,
'Train_F1_score': F1_score_train,
'Test_F1_score': F1_score_test})
comparison_frame
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1_score | Test_F1_score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Decision Tree | 1.0000 | 0.6678 | 1.0000 | 0.7522 | 1.0000 | 0.7510 | 1.0000 | 0.7516 |
| 1 | Bagging Classifier | 0.9851 | 0.6982 | 0.9864 | 0.7726 | 0.9913 | 0.7750 | 0.9888 | 0.7738 |
| 2 | Random Forest | 0.9999 | 0.7236 | 1.0000 | 0.8432 | 0.9999 | 0.7665 | 1.0000 | 0.8030 |
| 3 | AdaBoost with default paramters | 0.7390 | 0.7340 | 0.8905 | 0.8790 | 0.7600 | 0.7603 | 0.8201 | 0.8153 |
| 4 | Gradient Boosting with default parameters | 0.7568 | 0.7484 | 0.8801 | 0.8677 | 0.7829 | 0.7803 | 0.8286 | 0.8217 |
| 5 | XGBoost with default parameters | 0.8452 | 0.7269 | 0.9355 | 0.8475 | 0.8484 | 0.7679 | 0.8898 | 0.8057 |
XGBoost Tuned model model does not show much overfitting (a little more than the AdaBoost and Gradient Boost models) and the model is performing equally well on the train and test dataset . Minimal difference between the two.
Best performing model is Gradient Boosting. There is only a 1% difference in the AdaBoost and Gradient Boosting model in Recall with AdaBoost performing better, however Gradient Boosting performs better in Accuracy, Precision and F1 score.
Hyperparameters available for Decision Tree classifier include:
# Choose the type of classifier.
dt_tuned = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10, 15],
'min_impurity_decrease': [0.0001,0.001,0.01,0.1,1]
}
# Type of scoring used to compare parameter combinations
dt_scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
rand_dt = RandomizedSearchCV(dt_tuned, param_grid, scoring=dt_scorer, cv=5)
rand_rf = rand_dt.fit(X_train, y_train)
# Set the clf to the best combination of parameters
dt_tuned = rand_rf.best_estimator_
# Fit the best algorithm to the data.
dt_tuned.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10,
min_impurity_decrease=0.001, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10,
min_impurity_decrease=0.001, random_state=1)make_confusion_matrix(dt_tuned,y_test)
#Accuracy, recall, precision and F1 score on train and test set
dt_tuned_score=get_metrics_score(dt_tuned)
Accuracy on training set : 0.7307 Accuracy on test set : 0.7274 Recall on training set : 0.9071 Recall on test set : 0.9024 Precision on training set : 0.7452 Precision on test set : 0.7441 F1 score on training set : 0.8182 F1 score on test set : 0.8156
Tuning the decision tree helped the model, has made significant improvements to the original model.
Using the model, training and testing are performing equally well.
Tuning has removed the overfitting that was present on the decisiontree model.
importances = dt_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
<Figure size 1200x1200 with 0 Axes>
Text(0.5, 1.0, 'Feature Importances')
<BarContainer object of 33 artists>
([<matplotlib.axis.YTick at 0x7de4ac2fb640>, <matplotlib.axis.YTick at 0x7de4ac2fb610>, <matplotlib.axis.YTick at 0x7de4ac4fe500>, <matplotlib.axis.YTick at 0x7de4ac892c80>, <matplotlib.axis.YTick at 0x7de4ac83af20>, <matplotlib.axis.YTick at 0x7de4ac1feef0>, <matplotlib.axis.YTick at 0x7de4ac1fe410>, <matplotlib.axis.YTick at 0x7de4ac1ffeb0>, <matplotlib.axis.YTick at 0x7de4ac1ff2e0>, <matplotlib.axis.YTick at 0x7de4ac1fead0>, <matplotlib.axis.YTick at 0x7de4ac270ac0>, <matplotlib.axis.YTick at 0x7de4ac272860>, <matplotlib.axis.YTick at 0x7de4ac273310>, <matplotlib.axis.YTick at 0x7de4ac273dc0>, <matplotlib.axis.YTick at 0x7de4ac272cb0>, <matplotlib.axis.YTick at 0x7de4ac510670>, <matplotlib.axis.YTick at 0x7de4ac511120>, <matplotlib.axis.YTick at 0x7de4ac511bd0>, <matplotlib.axis.YTick at 0x7de4ac45ed70>, <matplotlib.axis.YTick at 0x7de4ac5123e0>, <matplotlib.axis.YTick at 0x7de4ac512e90>, <matplotlib.axis.YTick at 0x7de4ac513940>, <matplotlib.axis.YTick at 0x7de4ac260430>, <matplotlib.axis.YTick at 0x7de4ac5129e0>, <matplotlib.axis.YTick at 0x7de4ac260e50>, <matplotlib.axis.YTick at 0x7de4ac261900>, <matplotlib.axis.YTick at 0x7de4ac2623b0>, <matplotlib.axis.YTick at 0x7de4ac262e60>, <matplotlib.axis.YTick at 0x7de4ac262020>, <matplotlib.axis.YTick at 0x7de4ac263760>, <matplotlib.axis.YTick at 0x7de4ac24c250>, <matplotlib.axis.YTick at 0x7de4ac24cd00>, <matplotlib.axis.YTick at 0x7de4ac24d7b0>], [Text(0, 0, 'no_of_employees'), Text(0, 1, 'no_of_employees_range_25K and 75K'), Text(0, 2, 'annual_wage_range_Above 150K'), Text(0, 3, 'annual_wage_range_100K to 150K'), Text(0, 4, 'annual_wage_range_50K - 100K'), Text(0, 5, 'yr_of_estab_range_Less than 20 yrs'), Text(0, 6, 'yr_of_estab_range_20-49 yrs'), Text(0, 7, 'yr_of_estab_range_50-99 yrs'), Text(0, 8, 'yr_of_estab_range_100-149 yrs'), Text(0, 9, 'full_time_position_Y'), Text(0, 10, 'unit_of_wage_Year'), Text(0, 11, 'unit_of_wage_Week'), Text(0, 12, 'unit_of_wage_Month'), Text(0, 13, 'region_of_employment_West'), Text(0, 14, 'no_of_employees_range_75K to 100K'), Text(0, 15, 'region_of_employment_South'), Text(0, 16, 'no_of_employees_range_Above 100K'), Text(0, 17, 'requires_job_training_Y'), Text(0, 18, 'yr_of_estab'), Text(0, 19, 'prevailing_wage'), Text(0, 20, 'annual_wage'), Text(0, 21, 'continent_Asia'), Text(0, 22, 'continent_South America'), Text(0, 23, 'continent_Oceania'), Text(0, 24, 'region_of_employment_Northeast'), Text(0, 25, 'continent_North America'), Text(0, 26, 'education_of_employee_Doctorate'), Text(0, 27, "education_of_employee_Master's"), Text(0, 28, 'region_of_employment_Midwest'), Text(0, 29, 'continent_Europe'), Text(0, 30, 'has_job_experience_Y'), Text(0, 31, 'pay_term_Day'), Text(0, 32, 'education_of_employee_High School')])
Text(0.5, 0, 'Relative Importance')
"""The importance of features in the tree building
(The importance of a feature is computed as the (normalized) total
reduction of the criterion brought by that feature.)"""
print(pd.DataFrame(dt_tuned.feature_importances_, columns = ["Imp"],
index = X_train.columns).sort_values(by = 'Imp', ascending = False).head(10))
'The importance of features in the tree building \n(The importance of a feature is computed as the (normalized) total \nreduction of the criterion brought by that feature.)'
Imp education_of_employee_High School 0.3969 pay_term_Day 0.2176 has_job_experience_Y 0.1644 continent_Europe 0.0666 region_of_employment_Midwest 0.0536 education_of_employee_Master's 0.0463 education_of_employee_Doctorate 0.0399 continent_North America 0.0146 yr_of_estab_range_100-149 yrs 0.0000 yr_of_estab_range_50-99 yrs 0.0000
Education of employee in High School is the most important feature.
Only 8 features were deemed important.
Hyperparameters available for Bagging classifier include:
# Choose the type of classifier.
bag_tuned = BaggingClassifier(random_state=1)
# random search for bagging classifier
param_grid = {'max_samples': [0.7,0.8,0.9,1],
'max_features': [0.7,0.8,0.9,1],
'n_estimators' : np.arange(30, 50, 70)
}
#run the randomized search
rand_bag = RandomizedSearchCV(bag_tuned, param_grid, scoring='f1', cv=5, n_jobs=-1, random_state=1)
rand_bag = rand_bag.fit(X_train, y_train)
# Set the clf to the best combination of parameters
bag_tuned = rand_bag.best_estimator_
# Fit the best algorithm to the data
bag_tuned.fit(X_train, y_train)
BaggingClassifier(max_features=0.7, max_samples=0.7, n_estimators=30,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. BaggingClassifier(max_features=0.7, max_samples=0.7, n_estimators=30,
random_state=1)make_confusion_matrix(bag_tuned,y_test)
#Accuracy, recall, precision and F1 score on train and test set
bag_tuned_score=get_metrics_score(bag_tuned)
Accuracy on training set : 0.9885 Accuracy on test set : 0.7191 Recall on training set : 0.9958 Recall on test set : 0.8498 Precision on training set : 0.9872 Precision on test set : 0.7587 F1 score on training set : 0.9915 F1 score on test set : 0.8017
After tuning bagging model it is still overfit.
Model is performing better on test for accuracy, recall and F1 score but it still not reliable since it is overfit.
Bagging does not have Attributal Feature Importances.
Hyperparameters available for Random Forest classifier include:
# Choose the type of classifier.
rf_tuned = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
param_grid = {"n_estimators": np.arange(50, 110, 25),
"min_samples_leaf": np.arange(5, 10),
"min_samples_split": [3, 5, 7],
"max_features": [np.arange(0.3, 0.6, 0.1),"sqrt", "log2"],
"max_samples": np.arange(0.4, 0.7, 0.1),
}
# Run the randomized search
rand_rf = RandomizedSearchCV(rf_tuned, param_grid, scoring='f1', cv=5, n_jobs=-1, random_state=1)
rand_rf = rand_rf.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_tuned = rand_rf.best_estimator_
# Fit the best algorithm to the data.
rf_tuned.fit(X_train, y_train)
RandomForestClassifier(max_samples=0.4, min_samples_leaf=9, min_samples_split=3,
n_estimators=75, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(max_samples=0.4, min_samples_leaf=9, min_samples_split=3,
n_estimators=75, random_state=1)make_confusion_matrix(rf_tuned,y_test)
#Accuracy, recall, precision and F1 score on train and test set
rf_tuned_score=get_metrics_score(rf_tuned)
Accuracy on training set : 0.7668 Accuracy on test set : 0.7445 Recall on training set : 0.8975 Recall on test set : 0.8779 Precision on training set : 0.7845 Precision on test set : 0.7713 F1 score on training set : 0.8372 F1 score on test set : 0.8211
Tuning the random forrest helped the model, has made significant improvements to the original model.
Using the model, training and testing are performing equally well.
Tuning has removed the overfitting that was present on the decisiontree model.
importances = rf_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
<Figure size 1200x1200 with 0 Axes>
Text(0.5, 1.0, 'Feature Importances')
<BarContainer object of 33 artists>
([<matplotlib.axis.YTick at 0x7de4ac42beb0>, <matplotlib.axis.YTick at 0x7de4ac42af50>, <matplotlib.axis.YTick at 0x7de4ac8190c0>, <matplotlib.axis.YTick at 0x7de4ac7d8670>, <matplotlib.axis.YTick at 0x7de4ac4166b0>, <matplotlib.axis.YTick at 0x7de4ac414dc0>, <matplotlib.axis.YTick at 0x7de4ac417c40>, <matplotlib.axis.YTick at 0x7de4ac7ab010>, <matplotlib.axis.YTick at 0x7de4ac4143a0>, <matplotlib.axis.YTick at 0x7de4ac7aaa40>, <matplotlib.axis.YTick at 0x7de4ac7aa4d0>, <matplotlib.axis.YTick at 0x7de4ac7a9a20>, <matplotlib.axis.YTick at 0x7de4ac7a8040>, <matplotlib.axis.YTick at 0x7de4ac7a9ea0>, <matplotlib.axis.YTick at 0x7de4ac7ab2e0>, <matplotlib.axis.YTick at 0x7de4ac42c580>, <matplotlib.axis.YTick at 0x7de4ac42cd60>, <matplotlib.axis.YTick at 0x7de4ac42d540>, <matplotlib.axis.YTick at 0x7de4ac7a9870>, <matplotlib.axis.YTick at 0x7de4ac42c220>, <matplotlib.axis.YTick at 0x7de4ac42e080>, <matplotlib.axis.YTick at 0x7de4ac42e860>, <matplotlib.axis.YTick at 0x7de4ac42f040>, <matplotlib.axis.YTick at 0x7de4ac7a8370>, <matplotlib.axis.YTick at 0x7de4ac42de70>, <matplotlib.axis.YTick at 0x7de4ac42fa60>, <matplotlib.axis.YTick at 0x7de4ac42dae0>, <matplotlib.axis.YTick at 0x7de4ac3d0a60>, <matplotlib.axis.YTick at 0x7de4ac3d1240>, <matplotlib.axis.YTick at 0x7de4ac42ebc0>, <matplotlib.axis.YTick at 0x7de4ac3d13f0>, <matplotlib.axis.YTick at 0x7de4ac3d1bd0>, <matplotlib.axis.YTick at 0x7de4ac3d23b0>], [Text(0, 0, 'unit_of_wage_Month'), Text(0, 1, 'continent_Oceania'), Text(0, 2, 'unit_of_wage_Week'), Text(0, 3, 'yr_of_estab_range_100-149 yrs'), Text(0, 4, 'annual_wage_range_Above 150K'), Text(0, 5, 'no_of_employees_range_75K to 100K'), Text(0, 6, 'yr_of_estab_range_50-99 yrs'), Text(0, 7, 'no_of_employees_range_Above 100K'), Text(0, 8, 'continent_South America'), Text(0, 9, 'annual_wage_range_100K to 150K'), Text(0, 10, 'requires_job_training_Y'), Text(0, 11, 'yr_of_estab_range_Less than 20 yrs'), Text(0, 12, 'full_time_position_Y'), Text(0, 13, 'annual_wage_range_50K - 100K'), Text(0, 14, 'no_of_employees_range_25K and 75K'), Text(0, 15, 'yr_of_estab_range_20-49 yrs'), Text(0, 16, 'continent_North America'), Text(0, 17, 'region_of_employment_South'), Text(0, 18, 'region_of_employment_Northeast'), Text(0, 19, 'continent_Asia'), Text(0, 20, 'region_of_employment_West'), Text(0, 21, 'region_of_employment_Midwest'), Text(0, 22, 'pay_term_Day'), Text(0, 23, 'education_of_employee_Doctorate'), Text(0, 24, 'continent_Europe'), Text(0, 25, 'unit_of_wage_Year'), Text(0, 26, 'yr_of_estab'), Text(0, 27, 'annual_wage'), Text(0, 28, 'no_of_employees'), Text(0, 29, "education_of_employee_Master's"), Text(0, 30, 'has_job_experience_Y'), Text(0, 31, 'prevailing_wage'), Text(0, 32, 'education_of_employee_High School')])
Text(0.5, 0, 'Relative Importance')
"""The importance of features in the tree building
(The importance of a feature is computed as the (normalized) total
reduction of the criterion brought by that feature.)"""
print(pd.DataFrame(rf_tuned.feature_importances_, columns = ["Imp"],
index = X_train.columns).sort_values(by = 'Imp', ascending = False).head(10))
'The importance of features in the tree building \n(The importance of a feature is computed as the (normalized) total \nreduction of the criterion brought by that feature.)'
Imp education_of_employee_High School 0.1816 prevailing_wage 0.1251 has_job_experience_Y 0.1002 education_of_employee_Master's 0.0782 no_of_employees 0.0757 annual_wage 0.0717 yr_of_estab 0.0625 unit_of_wage_Year 0.0379 continent_Europe 0.0332 education_of_employee_Doctorate 0.0331
Education of employee in High School is the most important feature.
Only 31 features were deemed important.
Hyperparameters available for Bagging classifier include:
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
param_grid = {
#Let's try different max_depth for base_estimator
"base_estimator":[DecisionTreeClassifier(max_depth=1, random_state=1, class_weight='balanced'),
DecisionTreeClassifier(max_depth=2, random_state=1, class_weight='balanced'),
DecisionTreeClassifier(max_depth=3, random_state=1, class_weight='balanced')],
"n_estimators": np.arange(50,110,25),
"learning_rate":np.arange(0.01,0.1,0.5)
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.f1_score)
# Run the randomized search
rand_abc = RandomizedSearchCV(abc_tuned, param_grid, scoring=acc_scorer,cv=5, n_jobs=-1, random_state=1)
rand_abc = rand_abc.fit(X_train, y_train)
# Set the clf to the best combination of parameters
abc_tuned = rand_abc.best_estimator_
# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(class_weight='balanced',
max_depth=1,
random_state=1),
learning_rate=0.01, n_estimators=75, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. AdaBoostClassifier(base_estimator=DecisionTreeClassifier(class_weight='balanced',
max_depth=1,
random_state=1),
learning_rate=0.01, n_estimators=75, random_state=1)DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=1)
DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=1)
make_confusion_matrix(abc_tuned,y_test)
#Accuracy, recall, precision and F1 score on train and test set
abc_tuned_score=get_metrics_score(abc_tuned)
Accuracy on training set : 0.7289 Accuracy on test set : 0.725 Recall on training set : 0.9129 Recall on test set : 0.9063 Precision on training set : 0.7412 Precision on test set : 0.7403 F1 score on training set : 0.8182 F1 score on test set : 0.8149
Tuning the AdaBoost did not make much difference.
Recall went up slightly but Accuracy, Precision and F1 score went down slightly.
importances = abc_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
<Figure size 1200x1200 with 0 Axes>
Text(0.5, 1.0, 'Feature Importances')
<BarContainer object of 33 artists>
([<matplotlib.axis.YTick at 0x7de4ac0a6d40>, <matplotlib.axis.YTick at 0x7de4ac0a6800>, <matplotlib.axis.YTick at 0x7de4ac3d2ef0>, <matplotlib.axis.YTick at 0x7de4ac10a470>, <matplotlib.axis.YTick at 0x7de4ac10af20>, <matplotlib.axis.YTick at 0x7de4ac10b2e0>, <matplotlib.axis.YTick at 0x7de4ac10bac0>, <matplotlib.axis.YTick at 0x7de4a6f042e0>, <matplotlib.axis.YTick at 0x7de4a6f04ac0>, <matplotlib.axis.YTick at 0x7de4a6f052a0>, <matplotlib.axis.YTick at 0x7de4ac109f00>, <matplotlib.axis.YTick at 0x7de4a6f040a0>, <matplotlib.axis.YTick at 0x7de4a6f05cc0>, <matplotlib.axis.YTick at 0x7de4a6f064a0>, <matplotlib.axis.YTick at 0x7de4ac076cb0>, <matplotlib.axis.YTick at 0x7de4ac10b760>, <matplotlib.axis.YTick at 0x7de4a6f05b40>, <matplotlib.axis.YTick at 0x7de4a6f073a0>, <matplotlib.axis.YTick at 0x7de4a6f07b80>, <matplotlib.axis.YTick at 0x7de4a6f283a0>, <matplotlib.axis.YTick at 0x7de4a6f28b80>, <matplotlib.axis.YTick at 0x7de4a6f05750>, <matplotlib.axis.YTick at 0x7de4a6f28d60>, <matplotlib.axis.YTick at 0x7de4a6f29540>, <matplotlib.axis.YTick at 0x7de4a6f29d20>, <matplotlib.axis.YTick at 0x7de4a6f2a500>, <matplotlib.axis.YTick at 0x7de4a6f06e90>, <matplotlib.axis.YTick at 0x7de4a6f29870>, <matplotlib.axis.YTick at 0x7de4a6f2aaa0>, <matplotlib.axis.YTick at 0x7de4a6f2b280>, <matplotlib.axis.YTick at 0x7de4a6f2ba60>, <matplotlib.axis.YTick at 0x7de4a6f2b8b0>, <matplotlib.axis.YTick at 0x7de4a6f296f0>], [Text(0, 0, 'no_of_employees'), Text(0, 1, 'no_of_employees_range_25K and 75K'), Text(0, 2, 'annual_wage_range_Above 150K'), Text(0, 3, 'annual_wage_range_100K to 150K'), Text(0, 4, 'annual_wage_range_50K - 100K'), Text(0, 5, 'yr_of_estab_range_Less than 20 yrs'), Text(0, 6, 'yr_of_estab_range_20-49 yrs'), Text(0, 7, 'yr_of_estab_range_50-99 yrs'), Text(0, 8, 'yr_of_estab_range_100-149 yrs'), Text(0, 9, 'full_time_position_Y'), Text(0, 10, 'unit_of_wage_Year'), Text(0, 11, 'unit_of_wage_Week'), Text(0, 12, 'unit_of_wage_Month'), Text(0, 13, 'region_of_employment_West'), Text(0, 14, 'no_of_employees_range_75K to 100K'), Text(0, 15, 'region_of_employment_South'), Text(0, 16, 'region_of_employment_Midwest'), Text(0, 17, 'yr_of_estab'), Text(0, 18, 'prevailing_wage'), Text(0, 19, 'annual_wage'), Text(0, 20, 'continent_Asia'), Text(0, 21, 'continent_Europe'), Text(0, 22, 'region_of_employment_Northeast'), Text(0, 23, 'continent_North America'), Text(0, 24, 'continent_South America'), Text(0, 25, 'education_of_employee_Doctorate'), Text(0, 26, 'requires_job_training_Y'), Text(0, 27, 'continent_Oceania'), Text(0, 28, 'no_of_employees_range_Above 100K'), Text(0, 29, "education_of_employee_Master's"), Text(0, 30, 'pay_term_Day'), Text(0, 31, 'has_job_experience_Y'), Text(0, 32, 'education_of_employee_High School')])
Text(0.5, 0, 'Relative Importance')
"""The importance of features in the tree building
(The importance of a feature is computed as the (normalized) total
reduction of the criterion brought by that feature.)"""
print(pd.DataFrame(abc_tuned.feature_importances_, columns = ["Imp"],
index = X_train.columns).sort_values(by = 'Imp', ascending = False).head(10))
'The importance of features in the tree building \n(The importance of a feature is computed as the (normalized) total \nreduction of the criterion brought by that feature.)'
Imp education_of_employee_High School 0.6000 has_job_experience_Y 0.2267 pay_term_Day 0.1467 education_of_employee_Master's 0.0267 no_of_employees 0.0000 unit_of_wage_Year 0.0000 full_time_position_Y 0.0000 yr_of_estab_range_100-149 yrs 0.0000 yr_of_estab_range_50-99 yrs 0.0000 yr_of_estab_range_20-49 yrs 0.0000
Education of employee in High School is the most important feature.
Only 4 features were deemed important.
These are the same top 4 features in all the models so far.
Hyperparameters available for Gradient Boost classifier include:
gbc_init = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
gbc_init.fit(X_train,y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
random_state=1)AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
make_confusion_matrix(gbc_init,y_test)
#Using above defined function to get accuracy, recall and precision on train and test set
gbc_init_score=get_metrics_score(gbc_init)
Accuracy on training set : 0.7574 Accuracy on test set : 0.7483 Recall on training set : 0.8802 Recall on test set : 0.8687 Precision on training set : 0.7834 Precision on test set : 0.7797 F1 score on training set : 0.829 F1 score on test set : 0.8218
As compared to the model with default parameters:
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
# Grid of parameters to choose from
param_grid = {"n_estimators": [50,110,25],
"subsample":[0.7,0.8,0.9, 1],
"max_features":[0.5,0.7,1],
"learning_rate": np.arange(0.01, 0.1, 0.05)}
# Type of scoring used to compare parameter combinations
gbc_tuned_scorer = metrics.make_scorer(metrics.f1_score)
# Run the randomized search
rand_gb = RandomizedSearchCV(gbc_tuned, param_grid, scoring=gbc_tuned_scorer,cv=5, n_jobs=-1, random_state=1)
rand_gb = rand_gb.fit(X_train, y_train)
# Set the clf to the best combination of parameters
gbc_tuned = rand_gb.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
learning_rate=0.060000000000000005, max_features=0.7,
n_estimators=110, random_state=1, subsample=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
learning_rate=0.060000000000000005, max_features=0.7,
n_estimators=110, random_state=1, subsample=1)AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
make_confusion_matrix(gbc_tuned,y_test)
#Accuracy, recall, precision and F1 score on train and test set
gbc_tuned_score=get_metrics_score(gbc_tuned)
Accuracy on training set : 0.7549 Accuracy on test set : 0.7492 Recall on training set : 0.8786 Recall on test set : 0.8688 Precision on training set : 0.7816 Precision on test set : 0.7806 F1 score on training set : 0.8273 F1 score on test set : 0.8223
For training - the tuned Gradient Boost model is showing lower percentages on accuracy, F1 score and recall. The precision percentages stayed the same. This indicates the tuning has removed the overfitting that was in the original model.
For testing - the tuned Gradient Boost model is showing higher percentages on accuracy, precision and F1 score. The recall percentages went down slightly.
Training and testing are performing about equal.
There is a minor improvementon in the tuned model.
importances = gbc_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
<Figure size 1200x1200 with 0 Axes>
Text(0.5, 1.0, 'Feature Importances')
<BarContainer object of 33 artists>
([<matplotlib.axis.YTick at 0x7de4a6ea25f0>, <matplotlib.axis.YTick at 0x7de4a6ea0160>, <matplotlib.axis.YTick at 0x7de4a6ed1f60>, <matplotlib.axis.YTick at 0x7de4a6ed1720>, <matplotlib.axis.YTick at 0x7de4a6e4cd90>, <matplotlib.axis.YTick at 0x7de4a6e4f580>, <matplotlib.axis.YTick at 0x7de4a6e22d70>, <matplotlib.axis.YTick at 0x7de4a6d05c00>, <matplotlib.axis.YTick at 0x7de4a6d06080>, <matplotlib.axis.YTick at 0x7de4a6d06860>, <matplotlib.axis.YTick at 0x7de4a6e22ec0>, <matplotlib.axis.YTick at 0x7de4a6d063b0>, <matplotlib.axis.YTick at 0x7de4a6d07190>, <matplotlib.axis.YTick at 0x7de4a6d07970>, <matplotlib.axis.YTick at 0x7de4a6d07250>, <matplotlib.axis.YTick at 0x7de4a6d14970>, <matplotlib.axis.YTick at 0x7de4a6d06fe0>, <matplotlib.axis.YTick at 0x7de4a6d142e0>, <matplotlib.axis.YTick at 0x7de4a6d15300>, <matplotlib.axis.YTick at 0x7de4a6d15ae0>, <matplotlib.axis.YTick at 0x7de4a6d162c0>, <matplotlib.axis.YTick at 0x7de4a6d07fa0>, <matplotlib.axis.YTick at 0x7de4a6d14e80>, <matplotlib.axis.YTick at 0x7de4a6d16ce0>, <matplotlib.axis.YTick at 0x7de4a6d174c0>, <matplotlib.axis.YTick at 0x7de4a6d17ca0>, <matplotlib.axis.YTick at 0x7de4a6d17a90>, <matplotlib.axis.YTick at 0x7de4a6d154b0>, <matplotlib.axis.YTick at 0x7de4a6d30700>, <matplotlib.axis.YTick at 0x7de4a6d30ee0>, <matplotlib.axis.YTick at 0x7de4a6d316c0>, <matplotlib.axis.YTick at 0x7de4a6d31ea0>, <matplotlib.axis.YTick at 0x7de4a6d31810>], [Text(0, 0, 'no_of_employees_range_Above 100K'), Text(0, 1, 'unit_of_wage_Week'), Text(0, 2, 'yr_of_estab_range_100-149 yrs'), Text(0, 3, 'unit_of_wage_Month'), Text(0, 4, 'annual_wage_range_50K - 100K'), Text(0, 5, 'yr_of_estab_range_Less than 20 yrs'), Text(0, 6, 'annual_wage_range_Above 150K'), Text(0, 7, 'annual_wage_range_100K to 150K'), Text(0, 8, 'no_of_employees_range_75K to 100K'), Text(0, 9, 'no_of_employees_range_25K and 75K'), Text(0, 10, 'continent_Oceania'), Text(0, 11, 'yr_of_estab_range_50-99 yrs'), Text(0, 12, 'yr_of_estab_range_20-49 yrs'), Text(0, 13, 'full_time_position_Y'), Text(0, 14, 'requires_job_training_Y'), Text(0, 15, 'yr_of_estab'), Text(0, 16, 'continent_Asia'), Text(0, 17, 'no_of_employees'), Text(0, 18, 'annual_wage'), Text(0, 19, 'continent_South America'), Text(0, 20, 'region_of_employment_Northeast'), Text(0, 21, 'region_of_employment_South'), Text(0, 22, 'region_of_employment_West'), Text(0, 23, 'continent_North America'), Text(0, 24, 'unit_of_wage_Year'), Text(0, 25, 'region_of_employment_Midwest'), Text(0, 26, 'prevailing_wage'), Text(0, 27, 'continent_Europe'), Text(0, 28, 'education_of_employee_Doctorate'), Text(0, 29, "education_of_employee_Master's"), Text(0, 30, 'pay_term_Day'), Text(0, 31, 'has_job_experience_Y'), Text(0, 32, 'education_of_employee_High School')])
Text(0.5, 0, 'Relative Importance')
"""The importance of features in the tree building
(The importance of a feature is computed as the (normalized) total
reduction of the criterion brought by that feature.)"""
print(pd.DataFrame(gbc_tuned.feature_importances_, columns = ["Imp"],
index = X_train.columns).sort_values(by = 'Imp', ascending = False).head(10))
'The importance of features in the tree building \n(The importance of a feature is computed as the (normalized) total \nreduction of the criterion brought by that feature.)'
Imp education_of_employee_High School 0.2956 has_job_experience_Y 0.1677 pay_term_Day 0.0925 education_of_employee_Master's 0.0868 education_of_employee_Doctorate 0.0754 continent_Europe 0.0651 prevailing_wage 0.0453 region_of_employment_Midwest 0.0400 unit_of_wage_Year 0.0297 continent_North America 0.0197
Education of employee in High School is the most important feature.
Only 20 features were deemed important.
These are the same top 4 features in all the models so far.
Some of the hyperparameters available for XGBoost classifier include:
# Choose the type of classifier.
xgb_tuned = XGBClassifier(random_state=1,eval_metric='logloss')
# Grid of parameters to choose from
## add from
param_grid = {
"n_estimators": np.arange(50,110,25),
"scale_pos_weight":[1,2,5],
"subsample":[0.7,0.9],
"learning_rate":[0.01,0.1,0.05],
"gamma":[1,3],
}
# Type of scoring used to compare parameter combinations
xgb_scorer = metrics.make_scorer(metrics.recall_score)
# Run the randomized search
rand_xgb = RandomizedSearchCV(xgb_tuned, param_grid, scoring=xgb_scorer,cv=5, n_jobs=-1, random_state=1)
rand_xgb = rand_xgb.fit(X_train, y_train)
# Set the clf to the best combination of parameters
xgb_tuned = rand_xgb.best_estimator_
# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.05, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=75,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.05, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=75,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)make_confusion_matrix(xgb_tuned,y_test)
#Accuracy, recall, precision and F1 score on train and test set
xgb_tuned_score=get_metrics_score(xgb_tuned)
Accuracy on training set : 0.6921 Accuracy on test set : 0.684 Recall on training set : 0.9997 Recall on test set : 0.9951 Precision on training set : 0.6846 Precision on test set : 0.6801 F1 score on training set : 0.8127 F1 score on test set : 0.808
For testing - the tuned XGBoost model is showing lower percentages on accuracy, recall, precision and F1 score. This indicates the tuning has removed the overfitting that was in the original model.
For training - the tuned XGBoost model is showing higher percentages on accuracy, recall, precision and F1 score.
Training and testing are performing about equal.
There is a minor improvementon the tuned model.
importances = xgb_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
<Figure size 1200x1200 with 0 Axes>
Text(0.5, 1.0, 'Feature Importances')
<BarContainer object of 33 artists>
([<matplotlib.axis.YTick at 0x7de4a6d91750>, <matplotlib.axis.YTick at 0x7de4a6d902b0>, <matplotlib.axis.YTick at 0x7de4a6d53940>, <matplotlib.axis.YTick at 0x7de4a6de94b0>, <matplotlib.axis.YTick at 0x7de4a6de9600>, <matplotlib.axis.YTick at 0x7de4a6dea710>, <matplotlib.axis.YTick at 0x7de4a6de9900>, <matplotlib.axis.YTick at 0x7de4a6deae60>, <matplotlib.axis.YTick at 0x7de4a6deb640>, <matplotlib.axis.YTick at 0x7de4a6debe20>, <matplotlib.axis.YTick at 0x7de4a6c84640>, <matplotlib.axis.YTick at 0x7de4a6deb490>, <matplotlib.axis.YTick at 0x7de4a6de9ab0>, <matplotlib.axis.YTick at 0x7de4a6c84f10>, <matplotlib.axis.YTick at 0x7de4a6c85720>, <matplotlib.axis.YTick at 0x7de4a6c85f00>, <matplotlib.axis.YTick at 0x7de4a6deb1f0>, <matplotlib.axis.YTick at 0x7de4a6c85480>, <matplotlib.axis.YTick at 0x7de4a6c86920>, <matplotlib.axis.YTick at 0x7de4a6c87100>, <matplotlib.axis.YTick at 0x7de4a6c878b0>, <matplotlib.axis.YTick at 0x7de4a6c85420>, <matplotlib.axis.YTick at 0x7de4a6c86e60>, <matplotlib.axis.YTick at 0x7de4a6c855a0>, <matplotlib.axis.YTick at 0x7de4a6cc8a60>, <matplotlib.axis.YTick at 0x7de4a6cc9210>, <matplotlib.axis.YTick at 0x7de4a6cc99f0>, <matplotlib.axis.YTick at 0x7de4a6c87640>, <matplotlib.axis.YTick at 0x7de4a6cc88e0>, <matplotlib.axis.YTick at 0x7de4a6cca410>, <matplotlib.axis.YTick at 0x7de4a6ccabf0>, <matplotlib.axis.YTick at 0x7de4a6ccb3d0>, <matplotlib.axis.YTick at 0x7de4a6c86260>], [Text(0, 0, 'no_of_employees_range_Above 100K'), Text(0, 1, 'yr_of_estab_range_Less than 20 yrs'), Text(0, 2, 'annual_wage_range_Above 150K'), Text(0, 3, 'no_of_employees_range_25K and 75K'), Text(0, 4, 'annual_wage_range_100K to 150K'), Text(0, 5, 'unit_of_wage_Month'), Text(0, 6, 'yr_of_estab_range_20-49 yrs'), Text(0, 7, 'unit_of_wage_Week'), Text(0, 8, 'no_of_employees'), Text(0, 9, 'annual_wage_range_50K - 100K'), Text(0, 10, 'yr_of_estab_range_50-99 yrs'), Text(0, 11, 'annual_wage'), Text(0, 12, 'yr_of_estab'), Text(0, 13, 'yr_of_estab_range_100-149 yrs'), Text(0, 14, 'requires_job_training_Y'), Text(0, 15, 'no_of_employees_range_75K to 100K'), Text(0, 16, 'continent_South America'), Text(0, 17, 'prevailing_wage'), Text(0, 18, 'continent_North America'), Text(0, 19, 'region_of_employment_West'), Text(0, 20, 'full_time_position_Y'), Text(0, 21, 'continent_Oceania'), Text(0, 22, 'region_of_employment_Northeast'), Text(0, 23, 'continent_Asia'), Text(0, 24, 'unit_of_wage_Year'), Text(0, 25, 'region_of_employment_South'), Text(0, 26, 'region_of_employment_Midwest'), Text(0, 27, 'continent_Europe'), Text(0, 28, 'education_of_employee_Doctorate'), Text(0, 29, "education_of_employee_Master's"), Text(0, 30, 'has_job_experience_Y'), Text(0, 31, 'education_of_employee_High School'), Text(0, 32, 'pay_term_Day')])
Text(0.5, 0, 'Relative Importance')
"""The importance of features in the tree building
(The importance of a feature is computed as the (normalized) total
reduction of the criterion brought by that feature.)"""
print(pd.DataFrame(xgb_tuned.feature_importances_, columns = ["Imp"],
index = X_train.columns).sort_values(by = 'Imp', ascending = False).head(10))
'The importance of features in the tree building \n(The importance of a feature is computed as the (normalized) total \nreduction of the criterion brought by that feature.)'
Imp pay_term_Day 0.2943 education_of_employee_High School 0.2319 has_job_experience_Y 0.0603 education_of_employee_Master's 0.0422 education_of_employee_Doctorate 0.0415 continent_Europe 0.0310 region_of_employment_Midwest 0.0297 region_of_employment_South 0.0248 unit_of_wage_Year 0.0226 continent_Asia 0.0189
Pay term moved up to the most important feature for this model.
Education of employee in High School was moved to the second important feature.
Only 31 features were deemed important.
These are the same top 4 features in all the models so far.
# defining list of models
models = [dt_tuned, bag_tuned, rf_tuned, abc_tuned, gbc_init, gbc_tuned, xgb_tuned]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
F1_score_train = []
F1_score_test = []
# looping through all the models to get the accuracy, precall and precision scores
for model in models:
j = get_metrics_score(model,False)
acc_train.append(np.round(j[0],4))
acc_test.append(np.round(j[1],4))
recall_train.append(np.round(j[2],4))
recall_test.append(np.round(j[3],4))
precision_train.append(np.round(j[4],4))
precision_test.append(np.round(j[5],4))
F1_score_train.append(np.round(j[6],4))
F1_score_test.append(np.round(j[7],4))
# Set the display format to show 4 decimal places
pd.options.display.float_format = '{:.4f}'.format
comparison_frame = pd.DataFrame({'Model':['Decision Tree Tuned',
'Bagging Tuned',
'Random Forest Tuned',
'AdaBoost Tuned',
'Gradient Boosting with init=AdaBoost',
'Gradient Boosting Tuned',
'XGBoost Tuned'],
'Train_Accuracy': acc_train,
'Test_Accuracy': acc_test,
'Train_Recall': recall_train,
'Test_Recall': recall_test,
'Train_Precision':precision_train,
'Test_Precision':precision_test,
'Train_F1_score': F1_score_train,
'Test_F1_score': F1_score_test})
comparison_frame
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1_score | Test_F1_score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Decision Tree Tuned | 0.7307 | 0.7274 | 0.9071 | 0.9024 | 0.7452 | 0.7441 | 0.8182 | 0.8156 |
| 1 | Bagging Tuned | 0.9885 | 0.7191 | 0.9958 | 0.8498 | 0.9872 | 0.7587 | 0.9915 | 0.8017 |
| 2 | Random Forest Tuned | 0.7668 | 0.7445 | 0.8975 | 0.8779 | 0.7845 | 0.7713 | 0.8372 | 0.8211 |
| 3 | AdaBoost Tuned | 0.7289 | 0.7250 | 0.9129 | 0.9063 | 0.7412 | 0.7403 | 0.8182 | 0.8149 |
| 4 | Gradient Boosting with init=AdaBoost | 0.7574 | 0.7483 | 0.8802 | 0.8687 | 0.7834 | 0.7797 | 0.8290 | 0.8218 |
| 5 | Gradient Boosting Tuned | 0.7549 | 0.7492 | 0.8786 | 0.8688 | 0.7816 | 0.7806 | 0.8273 | 0.8223 |
| 6 | XGBoost Tuned | 0.6921 | 0.6840 | 0.9997 | 0.9951 | 0.6846 | 0.6801 | 0.8127 | 0.8080 |
XGBoost Tuned model is showing overfitting. Showing 100% for the train recall and 100% on test recall. The chance of it always being correct is unlikely.
Best performing model is Gradient Boosting Tuned of the tuned models. There is only a 3.12% difference in the Decison Tree and Gradient Boosting model in Recall with Decision Tree performing better, however Gradient Boosting performs better in Accuracy, Precision and F1 score.
# defining list of models
models = [decisiontree, dt_tuned, bagging, bag_tuned, ranfor, rf_tuned, abc, abc_tuned, gbc, gbc_init, gbc_tuned, xgb, xgb_tuned]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
F1_score_train = []
F1_score_test = []
# looping through all the models to get the accuracy, precall and precision scores
for model in models:
j = get_metrics_score(model,False)
acc_train.append(np.round(j[0],4))
acc_test.append(np.round(j[1],4))
recall_train.append(np.round(j[2],4))
recall_test.append(np.round(j[3],2))
precision_train.append(np.round(j[4],4))
precision_test.append(np.round(j[5],4))
F1_score_train.append(np.round(j[6],4))
F1_score_test.append(np.round(j[7],4))
# Set the display format to show 4 decimal places
pd.options.display.float_format = '{:.4f}'.format
comparison_frame = pd.DataFrame({'Model':['Decision Tree', 'Decision Tree Tuned',
'Bagging Classifier', 'Bagging Tuned',
'Random Forest', 'Random Forest Tuned',
'AdaBoost with default paramters', 'AdaBoost Tuned',
'Gradient Boosting with default parameters', 'Gradient Boosting with init=AdaBoost',
'Gradient Boosting Tuned','XGBoost with default parameters','XGBoost Tuned'],
'Train_Accuracy': acc_train,
'Test_Accuracy': acc_test,
'Train_Recall': recall_train,
'Test_Recall': recall_test,
'Train_Precision':precision_train,
'Test_Precision':precision_test,
'Train_F1_score': F1_score_train,
'Test_F1_score': F1_score_test})
comparison_frame
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1_score | Test_F1_score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Decision Tree | 1.0000 | 0.6678 | 1.0000 | 0.7500 | 1.0000 | 0.7510 | 1.0000 | 0.7516 |
| 1 | Decision Tree Tuned | 0.7307 | 0.7274 | 0.9071 | 0.9000 | 0.7452 | 0.7441 | 0.8182 | 0.8156 |
| 2 | Bagging Classifier | 0.9851 | 0.6982 | 0.9864 | 0.7700 | 0.9913 | 0.7750 | 0.9888 | 0.7738 |
| 3 | Bagging Tuned | 0.9885 | 0.7191 | 0.9958 | 0.8500 | 0.9872 | 0.7587 | 0.9915 | 0.8017 |
| 4 | Random Forest | 0.9999 | 0.7236 | 1.0000 | 0.8400 | 0.9999 | 0.7665 | 1.0000 | 0.8030 |
| 5 | Random Forest Tuned | 0.7668 | 0.7445 | 0.8975 | 0.8800 | 0.7845 | 0.7713 | 0.8372 | 0.8211 |
| 6 | AdaBoost with default paramters | 0.7390 | 0.7340 | 0.8905 | 0.8800 | 0.7600 | 0.7603 | 0.8201 | 0.8153 |
| 7 | AdaBoost Tuned | 0.7289 | 0.7250 | 0.9129 | 0.9100 | 0.7412 | 0.7403 | 0.8182 | 0.8149 |
| 8 | Gradient Boosting with default parameters | 0.7568 | 0.7484 | 0.8801 | 0.8700 | 0.7829 | 0.7803 | 0.8286 | 0.8217 |
| 9 | Gradient Boosting with init=AdaBoost | 0.7574 | 0.7483 | 0.8802 | 0.8700 | 0.7834 | 0.7797 | 0.8290 | 0.8218 |
| 10 | Gradient Boosting Tuned | 0.7549 | 0.7492 | 0.8786 | 0.8700 | 0.7816 | 0.7806 | 0.8273 | 0.8223 |
| 11 | XGBoost with default parameters | 0.8452 | 0.7269 | 0.9355 | 0.8500 | 0.8484 | 0.7679 | 0.8898 | 0.8057 |
| 12 | XGBoost Tuned | 0.6921 | 0.6840 | 0.9997 | 1.0000 | 0.6846 | 0.6801 | 0.8127 | 0.8080 |
XGBoost Tuned model is showing overfitting. Showing 99.97% for the train recall and 99.51% on test recall. The chance of it almost always being correct is unlikely. The Tuned model is performing worse than the original.
Best performing model is Gradient Boosting Tuned. There is only a 4% difference in the Decison Tree and Gradient Boosting Tuned model in Recall with Decision Tree performing better, however Gradient Boosting performs better in Accuracy, Precision and F1 score.
XGBooosting has a higher Recall percentage but its 99.51% demonstrates overfitting.
Choosing any of the Gradient Boosting models will help OFLC:
"""The importance of features in the tree building
(The importance of a feature is computed as the (normalized) total
reduction of the criterion brought by that feature.)"""
print(pd.DataFrame(gbc_tuned.feature_importances_, columns = ["Imp"],
index = X_train.columns).sort_values(by = 'Imp', ascending = False).head(10))
'The importance of features in the tree building \n(The importance of a feature is computed as the (normalized) total \nreduction of the criterion brought by that feature.)'
Imp education_of_employee_High School 0.2956 has_job_experience_Y 0.1677 pay_term_Day 0.0925 education_of_employee_Master's 0.0868 education_of_employee_Doctorate 0.0754 continent_Europe 0.0651 prevailing_wage 0.0453 region_of_employment_Midwest 0.0400 unit_of_wage_Year 0.0297 continent_North America 0.0197
"""The importance of features in the tree building
(The importance of a feature is computed as the (normalized) total
reduction of the criterion brought by that feature.)"""
print(pd.DataFrame(abc_tuned.feature_importances_, columns = ["Imp"],
index = X_train.columns).sort_values(by = 'Imp', ascending = False).head(10))
'The importance of features in the tree building \n(The importance of a feature is computed as the (normalized) total \nreduction of the criterion brought by that feature.)'
Imp education_of_employee_High School 0.6000 has_job_experience_Y 0.2267 pay_term_Day 0.1467 education_of_employee_Master's 0.0267 no_of_employees 0.0000 unit_of_wage_Year 0.0000 full_time_position_Y 0.0000 yr_of_estab_range_100-149 yrs 0.0000 yr_of_estab_range_50-99 yrs 0.0000 yr_of_estab_range_20-49 yrs 0.0000
"""The importance of features in the tree building
(The importance of a feature is computed as the (normalized) total
reduction of the criterion brought by that feature.)"""
print(pd.DataFrame(dt_tuned.feature_importances_, columns = ["Imp"],
index = X_train.columns).sort_values(by = 'Imp', ascending = False).head(10))
'The importance of features in the tree building \n(The importance of a feature is computed as the (normalized) total \nreduction of the criterion brought by that feature.)'
Imp education_of_employee_High School 0.3969 pay_term_Day 0.2176 has_job_experience_Y 0.1644 continent_Europe 0.0666 region_of_employment_Midwest 0.0536 education_of_employee_Master's 0.0463 education_of_employee_Doctorate 0.0399 continent_North America 0.0146 yr_of_estab_range_100-149 yrs 0.0000 yr_of_estab_range_50-99 yrs 0.0000
All 3 of the models are showing the same top 3 features, however each are showing them at different percent levels of importance.
#create visualization of tuned decision-tree model
feature_names = list(X.columns)
plt.figure(figsize=(30,10))
tree.plot_tree(dt_tuned,feature_names=feature_names,filled=True,fontsize=12,node_ids=True,class_names=True)
plt.show()
<Figure size 3000x1000 with 0 Axes>
[Text(0.6833333333333333, 0.9166666666666666, 'node #0\neducation_of_employee_High School <= 0.5\ngini = 0.443\nsamples = 17812\nvalue = [5912, 11900]\nclass = y[1]'), Text(0.5, 0.75, 'node #1\npay_term_Day <= 0.5\ngini = 0.404\nsamples = 15433\nvalue = [4335, 11098]\nclass = y[1]'), Text(0.3333333333333333, 0.5833333333333334, 'node #3\nhas_job_experience_Y <= 0.5\ngini = 0.376\nsamples = 14211\nvalue = [3561, 10650]\nclass = y[1]'), Text(0.2, 0.4166666666666667, 'node #5\ncontinent_Europe <= 0.5\ngini = 0.461\nsamples = 5672\nvalue = [2049, 3623]\nclass = y[1]'), Text(0.13333333333333333, 0.25, 'node #7\nregion_of_employment_Midwest <= 0.5\ngini = 0.481\nsamples = 4804\nvalue = [1928, 2876]\nclass = y[1]'), Text(0.06666666666666667, 0.08333333333333333, 'node #9\ngini = 0.495\nsamples = 3786\nvalue = [1700, 2086]\nclass = y[1]'), Text(0.2, 0.08333333333333333, 'node #10\ngini = 0.348\nsamples = 1018\nvalue = [228, 790]\nclass = y[1]'), Text(0.26666666666666666, 0.25, 'node #8\ngini = 0.24\nsamples = 868\nvalue = [121, 747]\nclass = y[1]'), Text(0.4666666666666667, 0.4166666666666667, "node #6\neducation_of_employee_Master's <= 0.5\ngini = 0.291\nsamples = 8539\nvalue = [1512, 7027]\nclass = y[1]"), Text(0.4, 0.25, 'node #11\neducation_of_employee_Doctorate <= 0.5\ngini = 0.359\nsamples = 4773\nvalue = [1117, 3656]\nclass = y[1]'), Text(0.3333333333333333, 0.08333333333333333, 'node #13\ngini = 0.395\nsamples = 3914\nvalue = [1062, 2852]\nclass = y[1]'), Text(0.4666666666666667, 0.08333333333333333, 'node #14\ngini = 0.12\nsamples = 859\nvalue = [55, 804]\nclass = y[1]'), Text(0.5333333333333333, 0.25, 'node #12\ngini = 0.188\nsamples = 3766\nvalue = [395, 3371]\nclass = y[1]'), Text(0.6666666666666666, 0.5833333333333334, 'node #4\nhas_job_experience_Y <= 0.5\ngini = 0.464\nsamples = 1222\nvalue = [774, 448]\nclass = y[0]'), Text(0.6, 0.4166666666666667, 'node #17\ngini = 0.421\nsamples = 776\nvalue = [542, 234]\nclass = y[0]'), Text(0.7333333333333333, 0.4166666666666667, 'node #18\ngini = 0.499\nsamples = 446\nvalue = [232, 214]\nclass = y[0]'), Text(0.8666666666666667, 0.75, 'node #2\ncontinent_North America <= 0.5\ngini = 0.447\nsamples = 2379\nvalue = [1577, 802]\nclass = y[0]'), Text(0.8, 0.5833333333333334, 'node #15\ngini = 0.43\nsamples = 2103\nvalue = [1446, 657]\nclass = y[0]'), Text(0.9333333333333333, 0.5833333333333334, 'node #16\ngini = 0.499\nsamples = 276\nvalue = [131, 145]\nclass = y[1]')]
Insights
For the Office of Foreign Labor Certification (OFLC), the three most critical components for pre-screening an applicant are:
Education level
Prior job experience
Pay_term_day
The top 3 features are related. A position that only requires a high school diploma, most likely requires little to no experience. These type of positions also usually pay by the hour.
Recommendations
To optimize the allocation of limited resources in screening applications, the OFLC can:
These steps aim to streamline the evaluation process and enhance efficiency in resource utilization.
Additonal information can be obtained about the employer and employees:
Incorporating this additional information could improve the prioritization of applicants based on certification/denial percentages derived from the model.
The following information was reviewed but found to have minimal impact on the approval of a visa application across all models:
These details may be omitted from the information obtained from the applicant.
%%shell
jupyter nbconvert --to html //'/content/drive/MyDrive/Python_Course/Project_5/DSBA_Project_ET_EasyVisa_Fullcode_V1.ipynb'
[NbConvertApp] Converting notebook ///content/drive/MyDrive/Python_Course/Project_5/DSBA_Project_ET_EasyVisa_Fullcode_V1.ipynb to html [NbConvertApp] Writing 4375855 bytes to /content/drive/MyDrive/Python_Course/Project_5/DSBA_Project_ET_EasyVisa_Fullcode_V1.html